Does Pandas allow custom objects as column labels? - python

In Pandas, I've been using custom objects as column labels because they provide rich/flexible functionality for info/methods specific to the column. For example, you can set a custom fmt_fn to format each column (note this is just an example, my actual column label objects are more complex):
In [100]: class Col:
...: def __init__(self, name, fmt_fn):
...: self.name = name
...: self.fmt_fn = fmt_fn
...: def __str__(self):
...: return self.name
...:
In [101]: sec_col = Col('time', lambda val: str(timedelta(seconds=val)).split('.')[0])
In [102]: dollar_col = Col('money', lambda val: '${:.2f}'.format(val))
In [103]: foo = pd.DataFrame(np.random.random((3, 2)) * 1000, columns = [sec_col, dollar_col])
In [104]: print(foo) # ugly
time money
0 773.181402 720.997051
1 33.779925 317.957813
2 590.750129 416.293245
In [105]: print(foo.to_string(formatters = [col.fmt_fn for col in foo.columns])) # pretty
time money
0 0:12:53 $721.00
1 0:00:33 $317.96
2 0:09:50 $416.29
Okay, so I've been happily doing this for a while, but then I recently came across one part of Pandas that doesn't support this. Specifically, methods to_hdf/read_hdf will fail on DataFrames with custom column labels. This is not a dealbreaker for me. I can use pickle instead of HDF5 at the loss of some efficiency.
But the bigger question is, does Pandas in general support custom objects as column labels? In other words, should I continue to use Pandas this way, or will this break in other parts of Pandas (besides HDF5) in the future, causing me future pain?
PS. As a side note, I wouldn't mind if you also chime in on how you solve the problem of column-specific info such as the fmt_fn in the example above, if you're not currently using custom objects as column labels.

Fine-grained control of formatting of a DataFrame isn't really possible right now. E.g., see here or here for some discussion of possibilities. I'm sure a well thought out API (and PR!) would be well received.
In terms of using custom objects as columns, the two biggest issues are probably serialization, and indexing semantics (e.g. can no longer do df['time']).
One possible work-around would be to wrap your DataFrame is some kind of pretty-print structure, like this:
In [174]: class PrettyDF(object):
...: def __init__(self, data, formatters):
...: self.data = data
...: self.formatters = formatters
...: def __str__(self):
...: return self.data.to_string(formatters=self.formatters)
...: def __repr__(self):
...: return self.__str__()
In [172]: foo = PrettyDF(df,
formatters={'money': '${:.2f}'.format,
'time': lambda val: str(timedelta(seconds=val)).split('.')[0]})
In [178]: foo
Out[178]:
time money
0 0:13:17 $399.29
1 0:08:48 $122.44
2 0:07:42 $491.72
In [180]: foo.data['time']
Out[180]:
0 797.699511
1 528.155876
2 462.999224
Name: time, dtype: float64

It's been five years since this was posted, so i hope this is still helpfull to someone. I've managed to build an object to hold metadata for a pandas dataframe column but still be accessable as a regular column (or so it seems to me). The code below is just the part of the whole class that involves this.
__repr is for presenting the name of the object if the dataframe is printed instead of the object
__eq is for checking the requested name to the available name of the objects __hash is also used in this process Column-names need to be hashable as it works the similar to a dictionary.
Thats probably not pythonic way of descibing it, but seems to me like thats the way it works.
class ColumnDescriptor:
def __init__(self, name, **kwargs):
self.name = name
[self.__setattr__(n, v) for n, v in kwargs.items()]
def __repr__(self): return self.name
def __str__(self): return self.name
def __eq__(self, other): return self.name == other
def __hash__(self): return hash(self.name)

Related

How do you associate metadata or annotations to a python function or method?

I am looking to build fairly detailed annotations for methods in a Python class. These to be used in troubleshooting, documentation, tooltips for a user interphase, etc. However it's not clear how I can keep these annotations associated to the functions.
For context, this is a feature engineering class, so two example methods might be:
def create_feature_momentum(self):
return self.data['mass'] * self.data['velocity'] *
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
For example:
It'd be good to tell easily what core features were used in each engineered feature.
It'd be good to track arbitrary metadata about each method
It'd be good to embed non-string data as metadata about each function. Eg. some example calculations on sample dataframes.
So far I've been manually creating docstrings like:
def create_feature_kinetic_energy(self)->pd.Series:
'''Calculate the non-relativistic kinetic energy.
Depends on: ['mass', 'velocity']
Supports NaN Values: False
Unit: Energy (J)
Example:
self.data= pd.DataFrame({'mass':[0,1,2], 'velocity':[0,1,2]})
self.create_feature_kinetic_energy()
>>> pd.Series([0, 0.5, 4])
'''
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
And then I'm using regex to get the data about a function by inspecting the __doc__ attribute. However, is there a better place than __doc__ where I could store information about a function? In the example above, it's fairly easy to parse the Depends on list, but in my use case it'd be good to also embed some example data as dataframes somehow (and I think writing them as markdown in the docstring would be hard).
Any ideas?
I ended up writing an class as follows:
class ScubaDiver(pd.DataFrame):
accessed = None
def __getitem__(self, key):
if self.accessed is None:
self.accessed = set()
self.accessed.add(key)
return pd.Series(dtype=float)
#property
def columns(self):
return list(self.accessed)
The way my code is writen, I can do this:
sd = ScubbaDiver()
foo(sd)
sd.columns
and sd.columns contains all the columns accessed by foo
Though this might not work in your codebase.
I also wrote this decorator:
def add_note(notes: dict):
'''Adds k:v pairs to a .notes attribute.'''
def _(f):
if not hasattr(f, 'notes'):
f.notes = {}
f.notes |= notes # Summation for dicts
return f
return _
You can use it as follows:
#add_note({'Units':'J', 'Relativity':False})
def create_feature_kinetic_energy(self):
return 0.5* self.data['mass'] * self.data['velocity'].pow(2)
and then you can do:
create_feature_kinetic_energy.notes['Units'] # J

Can I define how accessing python object (and not just its attributes) is handled?

I have a custom class in python, that I would like to behave in a certain way if the object itself (i.e., and not one if its methods/properties) is accessed.
This is a contrived minimal working example to show what I mean. I have a class that holds various pandas DataFrames so that they can separately be manipulated:
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
#property
def asonedf(self):
return pd.concat(self._dfs, axis=1)
d = SplitDataFrame(pd.DataFrame(np.random.rand(2,2), columns=['a','b']),
pd.DataFrame(np.random.rand(2,2), columns=['q','r']))
d.increase(0, 10)
This works, and I can examine that d._dfs now indeed is
[ a b
0 10.845681 10.561956
1 10.036739 10.262282,
q r
0 0.164336 0.412171
1 0.440800 0.945003]
So far, so good.
Now, I would like to change/add to the class's definition so that, when not using the .increase method, it returns the concatenated dataframe. In other words, when accessing d, I would like it to return the same dataframe as when typing d.asonedf, i.e.,
a b q r
0 10.143904 10.154455 0.776952 0.247526
1 10.039038 10.619113 0.443737 0.040389
That way, the object more closely follows the pandas.DataFrame api:
instead of needing to use d.asonedf['a'], I could access d['a'];
instead of needing to use d.asonedf + 12, I could do d + 12;
etc.
Is that possible?
I could make SplitDataFrame inherit from pandas.DataFrame, but that does not magically add the desired behaviour.
Many thanks!
You could of course proxy all relevant magic methods to a concatenated dataframe on demand. If you don't want to repeat yourself endlessly, you could dynamically do that.
I'm not saying this is the way to go, but it kind of works:
import textwrap
import pandas as pd
import numpy as np
class SplitDataFrame:
def __init__(self, df0, df1):
self._dfs = [df0, df1]
def increase(self, num, inc):
self._dfs[num] = self._dfs[num] + inc
for name in dir(pd.DataFrame):
if name in (
"__init__",
"__new__",
"__getattribute__",
"__getattr__",
"__setattr__",
) or not callable(getattr(pd.DataFrame, name)):
continue
exec(
textwrap.dedent(
f"""
def {name}(self, *args, **kwargs):
return pd.concat(self._dfs, axis=1).{name}(*args, **kwargs)
"""
)
)
As you might guess, there's all kind of strings attached to this solution, and it uses horrible practices (using exec, using dir, ...).
At the very least I would implement __repr__ so you don't lie to yourself about what kind of object this is, and maybe you'd want to explicitly enumerate all methods you want defined instead of getting them via dir(). Instead of exec() you can define the function normally and then set it on the class with setattr.

Pandas style object members

I'm trying to figure out how Pandas manages to create new object members on the fly. For example, if you do this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
You can immediately do this:
df.col1
and get the contents of col1. How does Pandas create the col1 member on the fly?
Thanks.
Relevant code in the repository that checks for a dictionary input:
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
if data is None:
data = {}
# Some other if-statements that check data types...
elif isinstance(data, dict): mgr = self._init_dict(data, index, columns, dtype=dtype)
Which uses_init_dict method:
def _init_dict(self, data, index, columns, dtype=None):
if columns is not None:
# Does some stuff - but this isn't your case
else:
keys = list(data.keys())
if not isinstance(data, OrderedDict):
# So this part is trying to sort keys to cols in alphabetical order
# The _try_sort function is simple, exists in pandas.core.common
keys = _try_sort(keys)
columns = data_names = Index(keys)
So the real work comes from the Index class in pandas.core.indexes.base. From there things start to get really complicated (and my understanding of what it means to explain the "how" of anything without continuing to regress until you get to machine code, started to melt away) but it's safe to say that if you give the pandas.Index class a 1-dimentional array of data it will create an object with a sliceable set and an associated data type.
Which is exactly what you're observing - you essentially fed it a bunch of keys and pandas understood that it needed to give you something back that you could access as an index (since df.col1 is just syntactic sugar for df['col1']), that you could slice (df[0:1]), and that knew its own data types.
And, of course, after asking the question, i find the answer myself.
It turns out you can use __getattr__ to achieve this. Easiest way (and the one i happen to want) is by using a dictionary, then using __getattr__ to return values from the dictionary, like so:
class test():
def __init__(self):
# super().__init__()
self.adict = {'spam' : 'eggs'}
def __getattr__(self, attr):
return self.adict[attr]
nt = test()
print(nt.spam)
__getattr__ is called when a class attribute isn't found, as is the case here. The interpreter can't find the spam attribute, so it defers this to __getattr__. Things to keep in mind:
If the key doesn't exist in the dictionary, this will raise a key error, not an attribute error.
Don't use __getattribute__, because it is called every time an attribute is called, messing up your entire class.
Thanks to everyone for the input on this.

How to handle a huge collection of functions in Python 2.7

So I am working on this existing code base which has about 150 or so functions as time series
I wanted to store them in a class in order to prevent namespace pollution.
This is what exists
import some.module
def func1(start_date, end_date):
some code here
return time_series
def func2(start_date, end_date):
some code here
return time_series
.
.
.
def func150(start_date, end_date):
some code here
return time_series
Each one of these functions is a unique name without any pattern. I tired to put them in a class
def function_builder(some_data):
def f(start_date, end_date):
some_code_here()
return series
return f
class TimeSeries():
func1 = function_builder(some_data)
func2 = function_builder(some_other_data)
.
.
.
func150 = function_builder(some_other_other_data)
My hope was that this would lead to me simply being able to import the time series and use it like
from some.location import TimeSeries as ts
#Actual code use
data = ts.func1(start_date, end_date)
But this approach throws the following error
TypeError: unbound method f() must be called with TimeSeries instance as first argument (got date instead)
please advise on how I should proceed with a huge collection of functions. I am new to programming and I want to do this correctly.
You're probably better off creating a submodule rather than a class with multiple functions. However, if you really want to do it the way you described, you need to use static methods instead of methods:
class TimeSeries():
func1 = staticmethod(function_builder(some_data))
func2 = staticmethod(function_builder(some_other_data))
# ...
Alternately, because you already have function_builder,
def function_builder(some_data):
def f(start_date, end_date):
some_code_here()
return series
return staticmethod(f)
class TimeSeries():
func1 = function_builder(some_data)
func2 = function_builder(some_other_data)
# ...
The staticmethod function takes a function and returns a static method-y version of it. Thus, it can also be used as a function decorator.
You can (should?) programatically generate your time series functions if your inputs to function_builder can be generated algorithmically. You can use __setattr__ or update __dict__ to add your functions to a submodule (or object in this module, but that's less elegant, IMHO).
I think what you really should do is separate your functions out into separate modules if you are trying to prevent name-space pollution. However, you could just use a SimpleNamespace:
In [1]: def func1(a, b):
...: return a + b
...: def func2(a, b, c):
...: return a*b*c
...: def func3(x):
...: return 2**x
...:
In [2]: from types import SimpleNamespace
In [3]: group1 = SimpleNamespace(func1=func1, func2=func2, func3=func3)
And now you've conveniently organized your name-spaces:
In [7]: group1.func1(1,2)
Out[7]: 3
In [8]: group1.func2(1, 2, 3)
Out[8]: 6
In [9]: group1.func3(8)
Out[9]: 256
Although, they will still be under the module's namespace if you do a simple import yourmodule. Even though SimpleNamespace is essentially a class, equivalent to the following:
class SimpleNamespace:
def __init__(self, **kwargs):
self.__dict__.update(kwargs)
def __repr__(self):
keys = sorted(self.__dict__)
items = ("{}={!r}".format(k, self.__dict__[k]) for k in keys)
return "{}({})".format(type(self).__name__, ", ".join(items))
def __eq__(self, other):
return self.__dict__ == other.__dict__

Class created with different number of class attributes every instance?

I know that sound generic, so I will try my best to explain the case.
I am getting back from a process, some data. I would like to organize this data in a format that I can interact with it, and access it.
So far I thought that I can make a simple class, which is empty, and at creation time it takes **kwargs and cycle trough them, adding them to the class.
Although I am not sure if this is the correct way to do so. Imagine the following data:
dict1={param1:a, param2:b, param3:c, param4:d} #Operation1
dict2={param2:t, param1:2, param1:r} #Operation2
dict3={param1:1, param7:2, param2:4, param4:b, param3:m} #Operation3
I would like to make a class that when creating, will take the parameters, resulting in the class attribute name taken from the parameter name, and as value, the value of that parameter:
myclass1(dict1)
myclass2.param1 return a, myclass1.param2 return b and so on
But if I want to make myclass using dict2, I can also do so:
myclass2(dict2)
myclass1.param2 return t, myclass2.param1 return 2 and so on
In this way I have a name with the parameter, and can retrieve it at later time, and at the same time I do not have to worry about how many element my data has, since the class will always get the number of parameters, and create class attributes, using the name of the key in the dictionary.
Is possible to achieve this in a simple way in python? I can use dictionary in a dictionary, but it feels utterly complicate, for big data sets.
You can do something like:
In [2]: dict1={'param1':'a', 'param2':'b', 'param3':'c', 'param4':'d'}
In [3]: class A(object):
...: def __init__(self, params):
...: for k, v in params.iteritems():
...: setattr(self, k, v)
...:
In [4]: a = A(dict1)
In [5]: a.param1
Out[5]: 'a'
In [6]: a.param2
Out[6]: 'b'

Categories

Resources