Python - Passing functions to another function where the arguments may be modified - python

I've written what's effectively a parser for a large amount of sequential data chunks, and I need to write a number of functions to analyze the data chunks in various ways. The parser contains some useful functionality for me such as frequency of reading data into (previously-instantiated) objects, conditional filtering of the data, and when to stop reading the file.
I would like to write external analysis functions in separate modules, import the parser, and pass the analysis function into the parser to evaluate at the end of every data chunk read. In general, the analysis functions will require variables modified within the parser itself (i.e. the data chunk that was read), but it may need additional parameters from the module where it's defined.
Here's essentially what I would like to do for the parser:
def parse_chunk(dat_file, dat_obj1, dat_obj2, parse_arg1=None, fun=None, **fargs):
# Process optional arguments to parser...
with open(dat_file,'r') as dat:
# Parse chunk of dat_file based on parse_arg1 and store data in dat_obj1, dat_obj2, etc.
dat_obj1.attr = parsed_data
local_var1 = dat_obj1.some_method()
# Call analysis function passed to parser
if fun != None:
return fun(**fargs)
In another module, I would have something like:
from parsemod import parse_chunk
def main_script():
# Preprocess data from other files
dat_obj1 = ...
dat_obj2 = ...
script_var1 = ...
# Parse data and analyze
result = parse_chunk(dat_file, dat_obj1, dat_obj2, fun=eval_prop,
dat_obj1=None, local_var1=None, foo=script_var1)
def eval_data(dat_obj1, local_var1, foo):
# Analyze data
...
return result
I've looked at similar questions such as this and this, but the issue here is that eval_data() has arguments which are modified or set in parse(), and since **fargs provides a dictionary, the variable names themselves are not in the namespace for parse(), so they aren't modified prior to calling eval_data().
I've thought about modifying the parser to just return all variables after every chunk read and call eval_data() from main_script(), but there are too many different possible variables needed for the different eval_data() functional forms, so this gets very clunky.
Here's another simplified example that's even more general:
def my_eval(fun, **kwargs):
x = 6
z = 1
return fun(**kwargs)
def my_fun(x, y, z):
return x + y + z
my_eval(my_fun, x=3, y=5, z=None)
I would like the result of my_eval() to be 12, as x gets overwritten from 3 to 6 and z gets set to 1. I looked into functools.partial but it didn't seem to work either.

To override kwargs you need to do
kwargs['variable'] = value # instead of just variable = value
in your case, in my_eval you need to do
kwargs['x'] = 6
kwargs['z'] = 1

Related

Python concurrent.futures

I have a multiprocessing code, and each process have to analyse same data differently.
I have implemented:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
and function:
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
the problem is goal_fcn is called only once, while should be multiple time
In debugger, I checked now the variable p is looking, and it has multiple columns and rows. Inside goal_fcn, variable x have only first row - looks good.
But the function is called only once. There is no error, the code just execute next steps.
Even if I modify variable p = [1,3,4,5], and of course code. goal_fcn is executed only once
I have to use map() because keeping the order between input and output is required
map works like zip. It terminates once at least one input sequence is at its end. Your [global_DataFrame] and [global_String] lists have one element each, so that is where map ends.
There are two ways around this:
Use itertools.product. This is the equivalent of running "for all data frames, for all strings, for all p". Something like this:
def goal_fcn(x_DataFrame_String):
x, DataFrame, String = x_DataFrame_String
...
executor.map(goal_fcn, itertools.product(p, [global_DataFrame], [global_String]))
Bind the fixed arguments instead of abusing the sequence arguments.
def goal_fcn(x, DataFrame, String):
pass
bound = functools.partial(goal_fcn, DataFrame=global_DataFrame, String=global_String)
executor.map(bound, p)

Dynamically adding functions to array columns

I'm trying to dynamically add function calls to fill in array columns. I will be accessing the array millions of times so it needs to be quick.
I'm thinking to add the call of a function into a dictionary by using a string variable
numpy_array[row,column] = dict[key[index containing function call]]
The full scope of the code I'm working with is too large to post here is an equivalent simplistic example I've tried.
def hello(input):
return input
dict1 = {}
#another function returns the name and ID values
name = 'hello'
ID = 0
dict1["hi"] = globals()[name](ID)
print (dict1)
but it literally activates the function when using
globals()[name](ID)
instead of copy pasting hello(0) as a variable into the dictionary.
I'm a bit out of my depth here.
What is the proper way to implement this?
Is there a more efficient way to do this than reading into a dictionary on every call of
numpy_array[row,column] = dict[key[index containing function call]]
as I will be accessing and updating it millions of times.
I don't know if the dictionary is called every time the array is written to or if the location of the column is already saved into cache.
Would appreciate the help.
edit
Ultimately what I'm trying to do is initialize some arrays, dictionaries, and values with a function
def initialize(*args):
create arrays and dictionaries
assign values to global and local variables, arrays, dictionaries
Each time the initialize() function is used it creates a new set of variables (names, values, ect) that direct to a different function with a different set of variables.
I have an numpy array which I want to store information from the function and associated values created from the initialize() function.
So in other words, in the above example hello(0), the name of the function, it's value, and some other things as set up within initialize()
What I'm trying to do is add the function with these settings to the numpy array as a new column before I run the main program.
So as another example. If I was setting up hello() (and hello() was a complex function) and when I used initialize() it might give me a value of 1 for hello(1).
Then if I use initialize again it might give me a value of 2 for hello(2).
If I used it one more time it might give the value 0 for the function goodbye(0).
So in this scenaro let's say I have an array
array[row,0] = stuff()
array[row,1] = things()
array[row,2] = more_stuff()
array[row,3] = more_things()
Now I want it to look like
array[row,0] = stuff()
array[row,1] = things()
array[row,2] = more_stuff()
array[row,3] = more_things()
array[row,4] = hello(1)
array[row,5] = hello(2)
array[row,6] = goodbye(0)
As a third, example.
def function1():
do something
def function2():
do something
def function3():
do something
numpy_array(size)
initialize():
do some stuff
then add function1(23) to the next column in numpy_array
initialize():
do some stuff
then add function2(5) to the next column in numpy_array
initialize():
do some stuff
then add function3(50) to the next column in numpy_array
So as you can see. I need to permanently append new columns to the array and feed the new columns with the function/value as directed by the initialize() function without manual intervention.
So fundamentally I need to figure out how to assign syntax to an array column based upon a string value without activating the syntax on assignment.
edit #2
I guess my explanations weren't clear enough.
Here is another way to look at it.
I'm trying to dynamically assign functions to an additional column in a numpy array based upon the output of a function.
The functions added to the array column will be used to fill the array millions of times with data.
The functions added to the array can be various different function with various different input values and the amount of functions added can vary.
I've tried assigning the functions to a dictionary using exec(), eval(), and globals() but when using these during assignment it just instantly activates the functions instead of assigning them.
numpy_array = np.array((1,5))
def some_function():
do some stuff
return ('other_function(15)')
#somehow add 'other_function(15)' to the array column.
numpy_array([1,6] = other_function(15)
The functions returned by some_function() may or may not exist each time the program is run so the functions added to the array are also dynamic.
I'm not sure this is what the OP is after, but here is a way to make an indirection of functions by name:
def make_fun_dict():
magic = 17
def foo(x):
return x + magic
def bar(x):
return 2 * x + 1
def hello(x):
return x**2
return {k: f for k, f in locals().items() if hasattr(f, '__name__')}
mydict = make_fun_dict()
>>> mydict
{'foo': <function __main__.make_fun_dict.<locals>.foo(x)>,
'bar': <function __main__.make_fun_dict.<locals>.bar(x)>,
'hello': <function __main__.make_fun_dict.<locals>.hello(x)>}
>>> mydict['foo'](0)
17
Example usage:
x = np.arange(5, dtype=int)
names = ['foo', 'bar', 'hello', 'foo', 'hello']
>>> np.array([mydict[name](v) for name, v in zip(names, x)])
array([17, 3, 4, 20, 16])

list of functions Python

I have a list of patterns:
patterns_trees = [response.css("#Header").xpath("//a/img/#src"),
response.css("#HEADER").xpath("//a/img/#src"),
response.xpath("//header//a/img/#src"),
response.xpath("//a[#href='"+response.url+'/'+"']/img/#src"),
response.xpath("//a[#href='/']/img/#src")
]
After I traverse it and find the right pattern I have to send the pattern as an argument to a callback function
for pattern_tree in patterns_trees:
...
pattern_response = scrapy.Request(...,..., meta={"pattern_tree": pattern_tree.extract_first()})
By doing this I get the value of the regex not the pattern
THINGS I TRIED:
I tried isolating the patterns in a separate class but still I have the problem that I can not store them as pattern but as values.
I tried to save them as strings and maybe I can make it work but
What is the most efficient way of storing list of functions
UPDATE: Possible solution but too hardcoded and it's too problematic when I want to add more patterns:
def patter_0(response):
response.css("#Header").xpath("//a/img/#src")
def patter_1(response):
response.css("#HEADER").xpath("//a/img/#src")
.....
class patternTrees:
patterns = [patter_0,...,patter_n]
def length_patterns(self):
return len(patterns)
If you're willing to consider reformatting your list of operations, then this is a somewhat neat solution. I've changed the list of operations to a list of tuples. Each tuple contains (a ref to) the appropriate function, and another tuple consisting of arguments.
It's fairly easy to add new operations to the list: just specify what function to use, and the appropriate arguments.
If you want to use the result from one operation as an argument in the next: You will have to return the value from execute() and process it in the for loop.
I've replaced the calls to response with prints() so that you can test it easily.
def response_css_ARG_xpath_ARG(args):
return "response.css(\"%s\").xpath(\"%s\")" % (args[0],args[1])
#return response.css(args[0]).xpath(args[1])
def response_xpath_ARG(arg):
return "return respons.xpath(\"%s\")" % (arg)
#return response.xpath(arg)
def execute(function, args):
response = function(args)
# do whatever with response
return response
response_url = "https://whatever.com"
patterns_trees = [(response_css_ARG_xpath_ARG, ("#Header", "//a/img/#src")),
(response_css_ARG_xpath_ARG, ("#HEADER", "//a/img/#src")),
(response_xpath_ARG, ("//header//a/img/#src")),
(response_xpath_ARG, ("//a[#href='"+response_url+"/"+"']/img/#src")),
(response_xpath_ARG, ("//a[#href='/']/img/#src"))]
for pattern_tree in patterns_trees:
print(execute(pattern_tree[0], pattern_tree[1]))
Note that execute() can be omitted! Depending on if you need to process the result or not. Without the executioner, you may just call the function directly from the loop:
for pattern_tree in patterns_trees:
print(pattern_tree[0](pattern_tree[1]))
Not sure I understand what you're trying to do, but could you make your list a list of lambda functions like so:
patterns_trees = [
lambda response : response.css("#Header").xpath("//a/img/#src"),
...
]
And then, in your loop:
for pattern_tree in patterns_trees:
intermediate_response = scrapy.Request(...) # without meta kwarg
pattern_response = pattern_tree(intermediate_response)
Or does leaving the meta away have an impact on the response object?

Organizing code for Testing with Constants from a Configuration File

My application reads in many constants from a configuration file. These constants are then used at various places throughout the program. Here is an example:
import ConfigParser
config = ConfigParser.SafeConfigParser()
config.read('my.conf')
DB_HOST = config.get('DatabaseInfo', 'address')
DB_PORT_NUMBER = config.getint('DatabaseInfo', 'port_number')
DB_NUMBER = config.getint('DatabaseInfo', 'db_number')
IN_SERVICE = config.get('ServerInfo', 'in_service')
IN_DATA = config.get('ServerInfo', 'in_data')
etc...
I then have functions defined throughout my program that use these constants. Here is an example:
def connect_to_db():
return get_connection(DB_HOST, DB_PORT_NUMBER, DB_NUMBER)
Sometimes, when I am testing or using the REPL, however, I don't want to use the values defined in the configuration file.
So I have instead defined the functions to except the constants as parameters:
def connect_to_db(db_host, db_port_number, db_number):
return get_connection(db_host, db_port_number, db_number)
And then when my program is run, the constants all passed in to my main which needs to call other functions which then call the functions and create classes (which all may be in different modules) that need the constants:
def main(db_host, db_port_number, db_number, in_service, in_data):
intermediate_function(
db_host, db_port_number, db_number, other, parameters, here
)
other_intermediate_function(in_service, in_data, more, params, here)
# etc...
def intermediate_function(db_host, db_port_number, db_number):
# Other processing...
c = connect_to_db(db_host, db_port_number, db_number)
# Continued...
if __name__ == '__main__':
main(DB_HOST, DB_PORT_NUMBER, DB_NUMBER, IN_SERVICE, IN_DATA)
The problem is that with too many constants, this quickly becomes unwieldy. And if I need to add another constant, I have several places to modify to ensure that my code doesn't break. This is a maintenance nightmare.
What is the proper Pythonic way of dealing with many different configuration constants, so that the code is still easy to modify and easy to test?
My idea, is really simple, use optional keyword arguments.
Here is a little example, what I'm talking about:
# collect, the constants to a dictionary,
# with the proper key names
d = dict(a=1, b=2, c=3)
# create functions, where you want to use the constants
# and use the same key-names as in dictionary, and also
# add '**kwargs' at the end of the attribute list
def func1(a, b, d=4, **kwargs):
print a, b, d
def func2(c, f=5, **kwargs):
print c, f
# now any time, you use the original dictionary
# as an argument for one of these functions, this
# will let the function select only those keywords
# that are used later
func1(**d)
# 1 2 4
func2(**d)
# 3 5
This idea allow you to modify the list of constants only at one place at the time, in you original dictionary.
So this is ported back to your idea:
This is your configuration.py:
# Your parsing, reading and storing functions goes here..
# Now create your dictionary
constants = dict(
host = DB_HOST,
pnum = DB_PORT_NUMBER,
num = DB_NUMBER,
serv = IN_SERVICE,
data = IN_DATA
)
Here is your other_file.py:
import configuration as cf
def main(host, pnum, num, serv, data, **kwargs):
intermediate_function(
host, pnum, num, 'other', 'params', 'here'
)
def intermediate_function(host, pnum, num, *args):
pass
# Now, you can call this function, with your
# dictionary as keyword arguments
if __name__ == '__main__':
main(**cf.constants)
Although this is working, I do not recommend this solution!
Your code is going to be harder to maintain, since every time you call one of those functions, where you pass your dictionary of constants, you will only see "one" argument: the dictionary itself, which is not so verbose. So I belive, you should think about a greater architecture of your code, where you use more deterministic functions (returning "real" values), and use them chained to each other, so you don't have to pass all those constants all the time. But this is my opinion:)
EDIT:
If my solution, mentioned above is suits you, I also have a better idea on how to store and parse your configuration file, and turn it automatically into a dictionary. Use JSON instead of simple .cfg file:
This will be your conf.json:
{
"host" : "DB_HOST",
"pnum" : "DB_PORT_NUMBER",
"num" : "DB_NUMBER",
"serv" : "IN_SERVICE",
"data" : "IN_DATA"
}
And your configuration.py will look like this:
import json
with open('conf.json') as conf:
# JSON parser will convert it to dictionary
constants = json.load(conf)

Processing a simple workflow in Python

I am working on a code which takes a dataset and runs some algorithms on it.
User uploads a dataset, and then selects which algorithms will be run on this dataset and creates a workflow like this:
workflow =
{0: {'dataset': 'some dataset'},
1: {'algorithm1': "parameters"},
2: {'algorithm2': "parameters"},
3: {'algorithm3': "parameters"}
}
Which means I'll take workflow[0] as my dataset, and I will run algorithm1 on it. Then, I will take its results and I will run algorithm2 on this results as my new dataset. And I will take the new results and run algorithm3 on it. It goes like this until the last item and there is no length limit for this workflow.
I am writing this in Python. Can you suggest some strategies about processing this workflow?
You want to run a pipeline on some dataset. That sounds like a reduce operation (fold in some languages). No need for anything complicated:
result = reduce(lambda data, (aname, p): algo_by_name(aname)(p, data), workflow)
This assumes workflow looks like (text-oriented so you can load it with YAML/JSON):
workflow = ['data', ('algo0', {}), ('algo1', {'param': value}), … ]
And that your algorithms look like:
def algo0(p, data):
…
return output_data.filename
algo_by_name takes a name and gives you an algo function; for example:
def algo_by_name(name):
return {'algo0': algo0, 'algo1': algo1, }[name]
(old edit: if you want a framework for writing pipelines, you could use Ruffus. It's like a make tool, but with progress support and pretty flow charts.)
If each algorithm works on each element on dataset, map() would be an elegant option:
dataset=workflow[0]
for algorithm in workflow[1:]:
dataset=map(algorithm, dataset)
e.g. for the square roots of odd numbers only, use,
>>> algo1=lambda x:0 if x%2==0 else x
>>> algo2=lambda x:x*x
>>> dataset=range(10)
>>> workflow=(dataset, algo1, algo2)
>>> for algo in workflow[1:]:
dataset=map(algo, dataset)
>>> dataset
[0, 1, 0, 9, 0, 25, 0, 49, 0, 81]
The way you want to do it seems sound to me, or you need to post more informations about what you are trying to accomplish.
And advice: I would put the workflow structure in a list with tuples rather than a dictionary
workflow = [ ('dataset', 'some dataset'),
('algorithm1', "parameters"),
('algorithm2', "parameters"),
('algorithm3', "parameters")]
Define a Dataset class that tracks... data... for your set. Define methods in this class. Something like this:
class Dataset:
# Some member fields here that define your data, and a constructor
def algorithm1(self, param1, param2, param3):
# Update member fields based on algorithm
def algorithm2(self, param1, param2):
# More updating/processing
Now, iterate over your "workflow" dict. For the first entry, simply instantiate your Dataset class.
myDataset = Dataset() # Whatever actual construction you need to do
For each subsequent entry...
Extract the key/value somehow (I'd recommend changing your workflow data structure if possible, dict is inconvenient here)
Parse the param string to a tuple of arguments (this step is up to you).
Assuming you now have the string algorithm and the tuple params for the current iteration...
getattr(myDataset, algorithm)(*params)
This will call the function on myDataset with the name specified by "algorithm" with the argument list contained in "params".
Here is how I would do this (all code untested):
Step 1: You need to create the algorithms. The Dataset could look like this:
class Dataset(object):
def __init__(self, dataset):
self.dataset = dataset
def __iter__(self):
for x in self.dataset:
yield x
Notice that you make an iterator out of it, so you iterate over it one item at a time. There's a reason for that, you'll see later:
Another algorithm could look like this:
class Multiplier(object):
def __init__(self, previous, multiplier):
self.previous = previous
self.multiplier = multiplier
def __iter__(self):
for x in previous:
yield x * self.multiplier
Step 2
Your user would then need to make a chain of this somehow. Now if he had access to Python directly, you can just do this:
dataset = Dataset(range(100))
multiplier = Multiplier(dataset, 5)
and then get the results by:
for x in multiplier:
print x
And it would ask the multiplier for one piece of data at a time, and the multiplier would in turn as the dataset. If you have a chain, then this means that one piece of data is handled at a time. This means you can handle huge amounts of data without using a lot of memory.
Step 3
Probably you want to specify the steps in some other way. For example a text file or a string (sounds like this may be web-based?). Then you need a registry over the algorithms. The easiest way is to just create a module called "registry.py" like this:
algorithms = {}
Easy, eh? You would register a new algorithm like so:
from registry import algorithms
algorithms['dataset'] = Dataset
algorithms['multiplier'] = Multiplier
You'd also need a method that creates the chain from specifications in a text file or something. I'll leave that up to you. ;)
(I would probably use the Zope Component Architecture and make algorithms components and register them in the component registry. But that is all strictly speaking overkill).

Categories

Resources