How to handle a huge collection of functions in Python 2.7

How to handle a huge collection of functions in Python 2.7 - python

So I am working on this existing code base which has about 150 or so functions as time series
I wanted to store them in a class in order to prevent namespace pollution.
This is what exists
import some.module
def func1(start_date, end_date):
some code here
return time_series
def func2(start_date, end_date):
some code here
return time_series
.
.
.
def func150(start_date, end_date):
some code here
return time_series
Each one of these functions is a unique name without any pattern. I tired to put them in a class
def function_builder(some_data):
def f(start_date, end_date):
some_code_here()
return series
return f
class TimeSeries():
func1 = function_builder(some_data)
func2 = function_builder(some_other_data)
.
.
.
func150 = function_builder(some_other_other_data)
My hope was that this would lead to me simply being able to import the time series and use it like
from some.location import TimeSeries as ts
#Actual code use
data = ts.func1(start_date, end_date)
But this approach throws the following error
TypeError: unbound method f() must be called with TimeSeries instance as first argument (got date instead)
please advise on how I should proceed with a huge collection of functions. I am new to programming and I want to do this correctly.

You're probably better off creating a submodule rather than a class with multiple functions. However, if you really want to do it the way you described, you need to use static methods instead of methods:
class TimeSeries():
func1 = staticmethod(function_builder(some_data))
func2 = staticmethod(function_builder(some_other_data))
# ...
Alternately, because you already have function_builder,
def function_builder(some_data):
def f(start_date, end_date):
some_code_here()
return series
return staticmethod(f)
class TimeSeries():
func1 = function_builder(some_data)
func2 = function_builder(some_other_data)
# ...
The staticmethod function takes a function and returns a static method-y version of it. Thus, it can also be used as a function decorator.
You can (should?) programatically generate your time series functions if your inputs to function_builder can be generated algorithmically. You can use __setattr__ or update __dict__ to add your functions to a submodule (or object in this module, but that's less elegant, IMHO).

I think what you really should do is separate your functions out into separate modules if you are trying to prevent name-space pollution. However, you could just use a SimpleNamespace:
In [1]: def func1(a, b):
...: return a + b
...: def func2(a, b, c):
...: return a*b*c
...: def func3(x):
...: return 2**x
...:
In [2]: from types import SimpleNamespace
In [3]: group1 = SimpleNamespace(func1=func1, func2=func2, func3=func3)
And now you've conveniently organized your name-spaces:
In [7]: group1.func1(1,2)
Out[7]: 3
In [8]: group1.func2(1, 2, 3)
Out[8]: 6
In [9]: group1.func3(8)
Out[9]: 256
Although, they will still be under the module's namespace if you do a simple import yourmodule. Even though SimpleNamespace is essentially a class, equivalent to the following:
class SimpleNamespace:
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
def __repr__(self):
keys = sorted(self.__dict__)
items = ("{}={!r}".format(k, self.__dict__[k]) for k in keys)
return "{}({})".format(type(self).__name__, ", ".join(items))
def __eq__(self, other):
return self.__dict__ == other.__dict__

Related

PySpark applyinpands/grouped_map pandas_udf too many arguments

I'm trying to use the pyspark applyInPandas in my python code. Problem is, the function that I want to pass to it exists in the same class, and so it is defined as def func(self, key, df). This becomes an issue because applyInPandas will error out saying I'm passing too many arguments to the underlying func (at most it allows a key and df params, so the self is causing the issue). Is there any way around this?
The underlying goal is to process a pandas function on dataframe groups in parallel.

As OP mentioned, one way is to just use #staticmethod, which may not be desirable in some cases.
The pyspark source code for creating pandas_udf uses inspect.getfullargspec().args (line 386, 436), this includes self even if the class method is called from the instance. I would think this is a bug on their part (maybe worthwhile to raise a ticket).
To overcome this, the easiest way is to use functools.partial which can help change the argspec, i.e. remove the self argument and restore the number of args to 2.
This is based on the idea that calling an instance method is the same as calling the method directly from the class and supply the instance as the first argument (because of the descriptor magic):
A.func(A(), *args, **kwargs) == A().func(*args, **kwargs)
In a concrete example,
import functools
import inspect
class A:
def __init__(self, y):
self.y = y
def sum(self, a: int, b: int):
return (a + b) * self.y
def x(self):
# calling the method using the class and then supply the self argument
f = functools.partial(A.sum, self)
print(f(1, 2))
print(inspect.getfullargspec(f).args)
A(2).x()
This will print
6 # can still use 'self.y'
['a', 'b'] # 2 arguments (without 'self')
Then, in OP's case, one can simply do the same for key, df parameters:
class A:
def __init__(self):
...
def func(self, key, df):
...
def x(self):
f = functools.partial(A.func, self)
self.df.groupby(...).applyInPandas(f)

Can I define how accessing python object (and not just its attributes) is handled?

I have a custom class in python, that I would like to behave in a certain way if the object itself (i.e., and not one if its methods/properties) is accessed.
This is a contrived minimal working example to show what I mean. I have a class that holds various pandas DataFrames so that they can separately be manipulated:
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
#property
def asonedf(self):
return pd.concat(self._dfs, axis=1)
d = SplitDataFrame(pd.DataFrame(np.random.rand(2,2), columns=['a','b']),
pd.DataFrame(np.random.rand(2,2), columns=['q','r']))
d.increase(0, 10)
This works, and I can examine that d._dfs now indeed is
[ a b
0 10.845681 10.561956
1 10.036739 10.262282,
q r
0 0.164336 0.412171
1 0.440800 0.945003]
So far, so good.
Now, I would like to change/add to the class's definition so that, when not using the .increase method, it returns the concatenated dataframe. In other words, when accessing d, I would like it to return the same dataframe as when typing d.asonedf, i.e.,
a b q r
0 10.143904 10.154455 0.776952 0.247526
1 10.039038 10.619113 0.443737 0.040389
That way, the object more closely follows the pandas.DataFrame api:
instead of needing to use d.asonedf['a'], I could access d['a'];
instead of needing to use d.asonedf + 12, I could do d + 12;
etc.
Is that possible?
I could make SplitDataFrame inherit from pandas.DataFrame, but that does not magically add the desired behaviour.
Many thanks!

You could of course proxy all relevant magic methods to a concatenated dataframe on demand. If you don't want to repeat yourself endlessly, you could dynamically do that.
I'm not saying this is the way to go, but it kind of works:
import textwrap
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
for name in dir(pd.DataFrame):
if name in (
"__init__",
"__new__",
"__getattribute__",
"__getattr__",
"__setattr__",
) or not callable(getattr(pd.DataFrame, name)):
continue
exec(
textwrap.dedent(
f"""
def {name}(self, *args, **kwargs):
return pd.concat(self._dfs, axis=1).{name}(*args, **kwargs)
"""
)
)
As you might guess, there's all kind of strings attached to this solution, and it uses horrible practices (using exec, using dir, ...).
At the very least I would implement __repr__ so you don't lie to yourself about what kind of object this is, and maybe you'd want to explicitly enumerate all methods you want defined instead of getting them via dir(). Instead of exec() you can define the function normally and then set it on the class with setattr.

Can I force an expression to be treated as a constant by numba?

Given some global non-mutated object of some type not known to numba:
from types import SimpleNamespace
a = SimpleNamespace(b=2)
I'd like to be able to reference a member of this object as a compile-time constant within a jitted function, something like this:
#numba.njit
def foo():
# return a.b # fails, because numba tries to evaluate at runtime
return numba.mark_this_as_constant(a.b)
Does mark_this_as_constant exist in numba under a different name already? Is it possible to write this myself, perhaps with a custom type?
I can get what I want today with:
def foo(a_b=a.b):
#numba.njit
def foo():
return a_b
return foo
foo = foo()
but this is pretty gross, and requires me to list every closure at the top, rather than at the point of use.

have you tried something like this?
a = SimpleNamespace(b=2)
a_b = a.b
#numba.njit
def foo():
return a_b

Custom Indexing Python Data Structure

I have a class that wraps around python deque from collections. When I go and create a deque x=deque(), and I want to reference the first variable....
In[78]: x[0]
Out[78]: 0
My question is how can use the [] for referencing in the following example wrapper
class deque_wrapper:
def __init__(self):
self.data_structure = deque()
def newCustomAddon(x):
return len(self.data_structure)
def __repr__(self):
return repr(self.data_structure)
Ie, continuing from above example:
In[75]: x[0]
Out[76]: TypeError: 'deque_wrapper' object does not support indexing
I want to customize my own referencing, is that possible?

You want to implement the __getitem__ method:
class DequeWrapper:
def __init__(self):
self.data_structure = deque()
def newCustomAddon(x):
return len(self.data_structure)
def __repr__(self):
return repr(self.data_structure)
def __getitem__(self, index):
# etc
Whenever you do my_obj[x], Python will actually call my_obj.__getitem__(x).
You may also want to consider implementing the __setitem__ method, if applicable. (When you write my_obj[x] = y, Python will actually run my_obj.__setitem__(x, y).
The documentation on Python data models will contain more information on which methods you need to implement in order to make custom data structures in Python.

How to keep help strings the same when applying decorators?

How can I keep help strings in functions to be visible after applying a decorator?
Right now the doc string is (partially) replaced with that of the inner function of the decorator.
def deco(fn):
def x(*args, **kwargs):
return fn(*args, **kwargs)
x.func_doc = fn.func_doc
x.func_name = fn.func_name
return x
#deco
def y(a, b):
"""This is Y"""
pass
def z(c, d):
"""This is Z"""
pass
help(y) # 1
help(z) # 2
In the Y function, required arguments aren't shown in the help. The user may assume it takes any arguments, while actually it doesn't.
y(*args, **kwargs) <= y(a, b) is desired
This is Y
z(c, d)
This is Z
I use help() and dir() a lot, since it's faster than pdf manuals, and want to make reliable document strings for my library and tools, but this is an obstacle.

give the decorator module a peek. i believe it does exactly what you want.
In [1]: from decorator import decorator
In [2]: #decorator
...: def say_hello(f, *args, **kwargs):
...: print "Hello!"
...: return f(*args, **kwargs)
...:
In [3]: #say_hello
...: def double(x):
...: return 2*x
...:
and info says "double(x)" in it.

What you're requesting is very hard to do "properly", because help gets the function signature from inspect.getargspec which in turn gets it from introspection which cannot directly be fooled -- to do it "properly" would mean generating a new function object on the fly (instead of a simple wrapper function) with the right argument names and numbers (and default values). Extremely hard, advanced, black-magic bytecode hacking required, in other words.
I think it may be easier to do it by monkeypatching (never a pleasant prospect, but sometimes the only way to perform customization tasks that are otherwise so difficult as to prove almost impossible, like the one you require) -- replace the real inspect.getargspec with your own lookalike function which uses a look-aside table (mapping the wrapper functions you generate to the wrapped functions' argspecs and otherwise delegating to the real thing).
import functools
import inspect
realgas = inspect.getargspec
lookaside = dict()
def fakegas(f):
if f in lookaside:
return lookaside[f]
return realgas(f)
inspect.getargspec = fakegas
def deco(fn):
#functools.wraps(fn)
def x(*args, **kwargs):
return fn(*args, **kwargs)
lookaside[x] = realgas(fn)
return x
#deco
def x(a, b=23):
"""Some doc for x."""
return a + b
help(x)
This prints, as required:
Help on function x in module __main__:
x(a, b=23)
Some doc for x.
(END)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle a huge collection of functions in Python 2.7 - python

Related

PySpark applyinpands/grouped_map pandas_udf too many arguments

Can I define how accessing python object (and not just its attributes) is handled?

Can I force an expression to be treated as a constant by numba?

Custom Indexing Python Data Structure

How to keep help strings the same when applying decorators?

Categories

Resources