I am working on a code which takes a dataset and runs some algorithms on it.
User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:
workflow =
{0: {'dataset': 'some dataset'},
1: {'algorithm1': "parameters"},
2: {'algorithm2': "parameters"},
3: {'algorithm3': "parameters"}
}
Which means I'll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.
I am writing this in Python. Can you suggest some strategies about processing this workflow?
You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:
result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)
This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):
workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]
And that your algorithms look like:
def algo0(p, data):
…
return output_data.filename
algo_by_name takes a name and gives you an algo function; for example:
def algo_by_name(name):
return {'algo0': algo0, 'algo1': algo1, }[name]
(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)
If each algorithm works on each element on dataset, map() would be an elegant option:
dataset=workflow[0]
for algorithm in workflow[1:]:
dataset=map(algorithm, dataset)
e.g. for the square roots of odd numbers only, use,
>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.
And advice: I would put the workflow structure in a list with tuples rather than a dictionary
workflow = [ ('dataset', 'some dataset'),
('algorithm1', "parameters"),
('algorithm2', "parameters"),
('algorithm3', "parameters")]
Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:
class Dataset:
# Some member fields here that define your data, and a constructor
def algorithm1(self, param1, param2, param3):
# Update member fields based on algorithm
def algorithm2(self, param1, param2):
# More updating/processing
Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.
myDataset = Dataset() # Whatever actual construction you need to do
For each subsequent entry...
Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
Parse the param string to a tuple of arguments (this step is up to you).
Assuming you now have the string algorithm and the tuple params for the current iteration...
getattr(myDataset, algorithm)(*params)
This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".
Here is how I would do this (all code untested):
Step 1: You need to create the algorithms. The Dataset could look like this:
class Dataset(object):
def __init__(self, dataset):
self.dataset = dataset
def __iter__(self):
for x in self.dataset:
yield x
Notice that you make an iterator out of it, so you iterate over it one item at a time. There's a reason for that, you'll see later:
Another algorithm could look like this:
class Multiplier(object):
def __init__(self, previous, multiplier):
self.previous = previous
self.multiplier = multiplier
def __iter__(self):
for x in previous:
yield x * self.multiplier
Step 2
Your user would then need to make a chain of this somehow. Now if he had access to Python directly, you can just do this:
dataset = Dataset(range(100))
multiplier = Multiplier(dataset, 5)
and then get the results by:
for x in multiplier:
print x
And it would ask the multiplier for one piece of data at a time, and the multiplier would in turn as the dataset. If you have a chain, then this means that one piece of data is handled at a time. This means you can handle huge amounts of data without using a lot of memory.
Step 3
Probably you want to specify the steps in some other way. For example a text file or a string (sounds like this may be web-based?). Then you need a registry over the algorithms. The easiest way is to just create a module called "registry.py" like this:
algorithms = {}
Easy, eh? You would register a new algorithm like so:
from registry import algorithms
algorithms['dataset'] = Dataset
algorithms['multiplier'] = Multiplier
You'd also need a method that creates the chain from specifications in a text file or something. I'll leave that up to you. ;)
(I would probably use the Zope Component Architecture and make algorithms components and register them in the component registry. But that is all strictly speaking overkill).
Related
I am looking to build fairly detailed annotations for methods in a Python class. These to be used in troubleshooting, documentation, tooltips for a user interphase, etc. However it's not clear how I can keep these annotations associated to the functions.
For context, this is a feature engineering class, so two example methods might be:
def create_feature_momentum(self):
return self.data['mass'] * self.data['velocity'] *
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
For example:
It'd be good to tell easily what core features were used in each engineered feature.
It'd be good to track arbitrary metadata about each method
It'd be good to embed non-string data as metadata about each function. Eg. some example calculations on sample dataframes.
So far I've been manually creating docstrings like:
def create_feature_kinetic_energy(self)->pd.Series:
'''Calculate the non-relativistic kinetic energy.
Depends on: ['mass', 'velocity']
Supports NaN Values: False
Unit: Energy (J)
Example:
self.data= pd.DataFrame({'mass':[0,1,2], 'velocity':[0,1,2]})
self.create_feature_kinetic_energy()
>>> pd.Series([0, 0.5, 4])
'''
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
And then I'm using regex to get the data about a function by inspecting the __doc__ attribute. However, is there a better place than __doc__ where I could store information about a function? In the example above, it's fairly easy to parse the Depends on list, but in my use case it'd be good to also embed some example data as dataframes somehow (and I think writing them as markdown in the docstring would be hard).
Any ideas?
I ended up writing an class as follows:
class ScubaDiver(pd.DataFrame):
accessed = None
def __getitem__(self, key):
if self.accessed is None:
self.accessed = set()
self.accessed.add(key)
return pd.Series(dtype=float)
#property
def columns(self):
return list(self.accessed)
The way my code is writen, I can do this:
sd = ScubbaDiver()
foo(sd)
sd.columns
and sd.columns contains all the columns accessed by foo
Though this might not work in your codebase.
I also wrote this decorator:
def add_note(notes: dict):
'''Adds k:v pairs to a .notes attribute.'''
def _(f):
if not hasattr(f, 'notes'):
f.notes = {}
f.notes |= notes # Summation for dicts
return f
return _
You can use it as follows:
#add_note({'Units':'J', 'Relativity':False})
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
and then you can do:
create_feature_kinetic_energy.notes['Units'] # J
I've written what's effectively a parser for a large amount of sequential data chunks, and I need to write a number of functions to analyze the data chunks in various ways. The parser contains some useful functionality for me such as frequency of reading data into (previously-instantiated) objects, conditional filtering of the data, and when to stop reading the file.
I would like to write external analysis functions in separate modules, import the parser, and pass the analysis function into the parser to evaluate at the end of every data chunk read. In general, the analysis functions will require variables modified within the parser itself (i.e. the data chunk that was read), but it may need additional parameters from the module where it's defined.
Here's essentially what I would like to do for the parser:
def parse_chunk(dat_file, dat_obj1, dat_obj2, parse_arg1=None, fun=None, **fargs):
# Process optional arguments to parser...
with open(dat_file,'r') as dat:
# Parse chunk of dat_file based on parse_arg1 and store data in dat_obj1, dat_obj2, etc.
dat_obj1.attr = parsed_data
local_var1 = dat_obj1.some_method()
# Call analysis function passed to parser
if fun != None:
return fun(**fargs)
In another module, I would have something like:
from parsemod import parse_chunk
def main_script():
# Preprocess data from other files
dat_obj1 = ...
dat_obj2 = ...
script_var1 = ...
# Parse data and analyze
result = parse_chunk(dat_file, dat_obj1, dat_obj2, fun=eval_prop,
dat_obj1=None, local_var1=None, foo=script_var1)
def eval_data(dat_obj1, local_var1, foo):
# Analyze data
...
return result
I've looked at similar questions such as this and this, but the issue here is that eval_data() has arguments which are modified or set in parse(), and since **fargs provides a dictionary, the variable names themselves are not in the namespace for parse(), so they aren't modified prior to calling eval_data().
I've thought about modifying the parser to just return all variables after every chunk read and call eval_data() from main_script(), but there are too many different possible variables needed for the different eval_data() functional forms, so this gets very clunky.
Here's another simplified example that's even more general:
def my_eval(fun, **kwargs):
x = 6
z = 1
return fun(**kwargs)
def my_fun(x, y, z):
return x + y + z
my_eval(my_fun, x=3, y=5, z=None)
I would like the result of my_eval() to be 12, as x gets overwritten from 3 to 6 and z gets set to 1. I looked into functools.partial but it didn't seem to work either.
To override kwargs you need to do
kwargs['variable'] = value # instead of just variable = value
in your case, in my_eval you need to do
kwargs['x'] = 6
kwargs['z'] = 1
I am currently struggling to log a function in python in a clean way.
Assume I want to log a function which has an obscure list_of_numbers as argument.
def function_to_log(list_of_numbers):
# manipulate the values in list_of_numbers ...
return some_result_computed_from_list_of_numbers
When the values in list_of_numbers are manipulated in the above function, I would like to log that change, but not with the value in the obscure list_of_numbers, but the value at the same index in a second list list_with_names_for_log.
The thing that annoys me: now I also have to input list_with_names_for_log, which bloats the argument list of my function,
def function_to_log(list_of_numbers, list_with_names_for_log):
# do some stuff like, change value on 3rd index:
list_of_numbers[3] = 17.4
log.info('You have changed {} to 17.4'.format(list_with_names_for_log[3]))
# and so on ...
return some_result_computed_from_list_of_numbers
I use multiple of these lists exclusively for logging in this function.
Has anybody an idea how to get this a little bit cleaner?
Provided it makes sense for the data to be grouped, I'd group the name/data pairs in a structure. What you currently have is essentially "parallel lists", which are typically a smell unless you're in a language where they're your only option.
A simple dataclass can be introduced:
from dataclasses import dataclass
#dataclass
class NamedData:
name: str
data: int
Then:
def function_to_log(pairs):
pairs[3].data = 17.4
log.info('You have changed {} to 17.4'.format(pairs[3].name))
# and so on ...
return some_result_computed_from_list_of_numbers
As a sample of data:
pairs = [NamedData("Some Name", 1), NamedData("Some other name", 2)]
And if you have two separate lists, it's simple to adapt:
pairs = [NamedData(name, data) for name, data in zip(names, data_list)]
Only do this though if you typically need both bits in most places that each list is used. Grouping will only help clean up the code if the name and data are both needed in most places that either is used. Otherwise, you're just introducing overhead and bloat elsewhere to clean up a few calls.
I have a class whose members are lists of numbers built by accumulating values from experimental data, like
class MyClass:
def __init__(self):
container1 = []
container2 = []
...
def accumulate_from_dataset(self,dataset):
for entry in dataset:
container1.append( foo (entry) )
container2.append( bar (entry) )
...
def process_accumulated_data(self):
'''called when all the data is gathered
'''
process1(container1)
process2(container2)
...
Issue: it would be beneficial if I could convert all the lists into numpy arrays.
what I tried: the simple conversion
self.container1 = np.array(self.container1)
works. Although, if I would like to consider "more fields in one shot", like
lists_to_convert = [self.container1, self.container2, ...]
def converter(lists_to_convert):
for list in lists_to_convert:
list = np.array(list)
there is not any effective change since the references to the class members are passed by value.
I am thus wondering if there is a smart approach/workaround to handle the whole conversion process.
Any help appreciated
From The Pragmatic Programmer:
Ask yourself: "Does it have to be done this way? Does it have to be done at all?
Maybe you should rethink your data structure? Maybe some dictionary or a simple list of lists would be easier to handle?
Note that in the example presented, container1 and container2 are just transformations on the initial dataset. It looks like a good place for list comprehension:
foo_data = [foo(d) for d in dataset]
# or even
foo_data = map(foo, dataset)
# or generator version
foo_data_iter = (foo(d) for d in dataset)
If you really want to operate on the instance variables as in the example, have a look at getattr and hasattr built-in functions
There isn't an easy way to do this because as you say python passes "by-reference-by-value"
You could add a to_numpy method in your class:
class MyClass:
def __init__(self):
container1 = []
container2 = []
...
def to_numpy(self,container):
list = self.__getattr__(container)
self.__setattr__(container,np.array(list))
...
And then do something like:
object = MyClass()
lists_to_convert = ["container1", "container2" ...]
def converter(lists_to_convert):
for list in lists_to_convert:
object.to_numpy(list)
But it's not very pretty and this sort of code would normally make me take a step back and think about my design.
I have a situation where I have six possible situations which can relate to four different results. Instead of using an extended if/else statement, I was wondering if it would be more pythonic to use a dictionary to call the functions that I would call inside the if/else as a replacement for a "switch" statement, like one might use in C# or php.
My switch statement depends on two values which I'm using to build a tuple, which I'll in turn use as the key to the dictionary that will function as my "switch". I will be getting the values for the tuple from two other functions (database calls), which is why I have the example one() and zero() functions.
This is the code pattern I'm thinking of using which I stumbled on with playing around in the python shell:
def one():
#Simulated database value
return 1
def zero():
return 0
def run():
#Shows the correct function ran
print "RUN"
return 1
def walk():
print "WALK"
return 1
def main():
switch_dictionary = {}
#These are the values that I will want to use to decide
#which functions to use
switch_dictionary[(0,0)] = run
switch_dictionary[(1,1)] = walk
#These are the tuples that I will build from the database
zero_tuple = (zero(), zero())
one_tuple = (one(), one())
#These actually run the functions. In practice I will simply
#have the one tuple which is dependent on the database information
#to run the function that I defined before
switch_dictionary[zero_tuple]()
switch_dictionary[one_tuple]()
I don't have the actual code written or I would post it here, as I would like to know if this method is considered a python best practice. I'm still a python learner in university, and if this is a method that's a bad habit, then I would like to kick it now before I get out into the real world.
Note, the result of executing the code above is as expected, simply "RUN" and "WALK".
edit
For those of you who are interested, this is how the relevant code turned out. It's being used on a google app engine application. You should find the code is considerably tidier than my rough example pattern. It works much better than my prior convoluted if/else tree.
def GetAssignedAgent(self):
tPaypal = PaypalOrder() #Parent class for this function
tAgents = []
Switch = {}
#These are the different methods for the actions to take
Switch[(0,0)] = tPaypal.AssignNoAgent
Switch[(0,1)] = tPaypal.UseBackupAgents
Switch[(0,2)] = tPaypal.UseBackupAgents
Switch[(1,0)] = tPaypal.UseFullAgents
Switch[(1,1)] = tPaypal.UseFullAndBackupAgents
Switch[(1,2)] = tPaypal.UseFullAndBackupAgents
Switch[(2,0)] = tPaypal.UseFullAgents
Switch[(2,1)] = tPaypal.UseFullAgents
Switch[(2,2)] = tPaypal.UseFullAgents
#I'm only interested in the number up to 2, which is why
#I can consider the Switch dictionary to be all options available.
#The "state" is the current status of the customer agent system
tCurrentState = (tPaypal.GetNumberofAvailableAgents(),
tPaypal.GetNumberofBackupAgents())
tAgents = Switch[tCurrentState]()
Consider this idiom instead:
>>> def run():
... print 'run'
...
>>> def walk():
... print 'walk'
...
>>> def talk():
... print 'talk'
>>> switch={'run':run,'walk':walk,'talk':talk}
>>> switch['run']()
run
I think it is a little more readable than the direction you are heading.
edit
And this works as well:
>>> switch={0:run,1:walk}
>>> switch[0]()
run
>>> switch[max(0,1)]()
walk
You can even use this idiom for a switch / default type structure:
>>> default_value=1
>>> try:
... switch[49]()
... except KeyError:
... switch[default_value]()
Or (the less readable, more terse):
>>> switch[switch.get(49,default_value)]()
walk
edit 2
Same idiom, extended to your comment:
>>> def get_t1():
... return 0
...
>>> def get_t2():
... return 1
...
>>> switch={(get_t1(),get_t2()):run}
>>> switch
{(0, 1): <function run at 0x100492d70>}
Readability matters
It is a reasonably common python practice to dispatch to functions based on a dictionary or sequence lookup.
Given your use of indices for lookup, an list of lists would also work:
switch_list = [[run, None], [None, walk]]
...
switch_list[zero_tuple]()
What is considered most Pythonic is that which maximizes clarity while meeting other operational requirements. In your example, the lookup tuple doesn't appear to have intrinsic meaning, so the operational intent is being lost of a magic constant. Try to make sure the business logic doesn't get lost in your dispatch mechanism. Using meaningful names for the constants would likely help.