Why does subclassing a DataFrame mutate the original object? - python

I am ignoring the warnings and trying to subclass a pandas DataFrame. My reasons for doing so are as follows:
I want to retain all the existing methods of DataFrame.
I want to set a few additional attributes at class instantiation, which will later be used to define additional methods that I can call on the subclass.
Here's a snippet:
class SubFrame(pd.DataFrame):
def __init__(self, *args, **kwargs):
freq = kwargs.pop('freq', None)
ddof = kwargs.pop('ddof', None)
super(SubFrame, self).__init__(*args, **kwargs)
self.freq = freq
self.ddof = ddof
self.index.freq = pd.tseries.frequencies.to_offset(self.freq)
#property
def _constructor(self):
return SubFrame
Here's a use example. Say I have the DataFrame
print(df)
col0 col1 col2
2014-07-31 0.28393 1.84587 -1.37899
2014-08-31 5.71914 2.19755 3.97959
2014-09-30 -3.16015 -7.47063 -1.40869
2014-10-31 5.08850 1.14998 2.43273
2014-11-30 1.89474 -1.08953 2.67830
where the index has no frequency
print(df.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
'2014-11-30'],
dtype='datetime64[ns]', freq=None)
Using SubFrame allows me to specify that frequency in one step:
sf = SubFrame(df, freq='M')
print(sf.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
'2014-11-30'],
dtype='datetime64[ns]', freq='M')
The issue is, this modifies df:
print(df.index.freq)
<MonthEnd>
What's going on here, and how can I avoid this?
Moreover, I profess to using copied code that I don't understand all that well. What is happening within __init__ above? Is it necessary to use args/kwargs with pop here? (Why can't I just specify params as usual?)

I'll add to the warnings. Not that I want to discourage you, I actually applaud your efforts.
However, this won't the last of your questions as to what is going on.
That said, once you run:
super(SubFrame, self).__init__(*args, **kwargs)
self is a bone-fide dataframe. You created it by passing another dataframe to the constructor.
Try this as an experiment
d1 = pd.DataFrame(1, list('AB'), list('XY'))
d2 = pd.DataFrame(d1)
d2.index.name = 'IDX'
d1
X Y
IDX
A 1 1
B 1 1
So the observed behavior is consistent, in that when you construct one dataframe by passing another dataframe to the constructor, you end up pointing to the same objects.
To answer your question, subclassing isn't what is allowing the mutating of the original object... its the way pandas constructs a dataframe from a passed dataframe.
Avoid this by instantiating with a copy
d2 = pd.DataFrame(d1.copy())
What's going on in the __init__
You want to pass on all the args and kwargs to pd.DataFrame.__init__ with the exception of the specific kwargs that are intended for your subclass. In this case, freq and ddof. pop is a convenient way to grab the values and delete the key from kwargs before passing it on to pd.DataFrame.__init__
How I'd implement pipe
def add_freq(df, freq):
df = df.copy()
df.index.freq = pd.tseries.frequencies.to_offset(freq)
return df
df = pd.DataFrame(dict(A=[1, 2]), pd.to_datetime(['2017-03-31', '2017-04-30']))
df.pipe(add_freq, 'M')

Related

Python assign different variables to a class object

This is a general python question. Is it possible to assign different variables to a class object and then perform different set of operations on those variables? I'm trying to reduce code but maybe this isn't how it works. For example, I'm trying to do something like this:
Edit: here is an abstract of the class and methods:
class Class:
def __init__(self, df):
self.df = df
def query(self, query):
self.df = self.df.query(query)
return self
def fill(self, filter):
self.df.update(df.filter(like=filter).mask(lambda x: x == 0).ffill(1))
return self
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
self.df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return self
def melt(self, cols, var=None, value=None):
return pd.melt(self.df, id_vars=columns, var_name=var, value_name=value)
I'm trying to use it like this:
df = pd.read_csv('data.csv')
df = Class(df)
df = df.query(query).forward_fill(include)
df_1 = df.diff(cols).melt(cols)
df_2 = df.melt(cols)
df_1 and df_2 should have different values, however they are the same as df_1. This issue is resolved if I use the class like this:
df_1 = pd.read_csv('data.csv')
df_2 = pd.read_csv('data.csv')
df_1 = Class(df_1)
df_2 = Class(df_2)
df_1 = df_1.query(query).forward_fill(include)
df_2 = df_2.query(query).forward_fill(include)
df_1 = df_1.diff(cols).melt(cols)
df_2 = df_2.melt(cols)
This results in extra code. Is there a better way to do this where you can use an object differently on different variables, or do I have to create seperate objects if I'm trying to have two variables perform separate operations and return different values?
With the return self statement in the diff- method you return the reference of the object. The same thing happens after the melt method. But in that two methods you allreadey manipulated the origin df.
Here:
1 df = pd.read_csv('data.csv')
2
3 df = Class(df)
4 df = df.query(query).forward_fill(include)
5
6 df_1 = df.diff(cols).melt(cols)
the df has the same values like df_1. I guess the melt method without other args then cols arguments only assigns col names or something like that. Subsequently df_2=df.melt(cols) would have the same result like df_2=df_1.melt(cols).
If you want to work with one object, you dont should use self.df=... in your class methods, because this changes the instance value of df. You only need to write df = ... and than return Class(df).
For example:
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return Class(df)
Best regards

python/pandas: DataFrame inheritance and DataFrame update when 'inplace' is not possible

I am sorry, I am aware the title is somewhat fuzzy.
Context
I am using a Dataframe to keep track of files because pandas DataFrame features several relevant functions to do all kind of filtering a dict cannot do, with loc, pd.IndexSlice, .index, .columns, pd.MultiIndex...
Ok, so this may not appear as the best choice for expert developers (which I am not), but all these functions have been so much handy that I have come to use a DataFrame for this.
And cherry on the cake, __repr__ of a MultiIndex Dataframe is just perfect when I want to know what is inside my file list.
Quick introduction to Summary class, inheriting from DataFrame
Because my DataFrame, that I call 'Summary', has some specific functions, I would like to make it a class, inheriting from pandas DataFrame class.
It also has 'fixed' MultiIndexes, for both rows and columns.
Finally, because my Summary class is defined outside the Store class which is actually managing file organization, Summary class needs a function from Store to be able to retrieve file organization.
Questions
Trouble with pd.DataFrame is (AFAIK) you cannot append rows without creating a new DataFrame.
As Summary has a refresh function so that it can recreate itself by reading folder content, a refresh somehow 'reset' the 'Summary' object.
To manage Summary refresh, I have come up with a first code (not working) and finally a second one (working).
import pandas as pd
import numpy as np
# Dummy function
def summa(a,b):
return a+b
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
self = pd.DataFrame(values, index=rmidx, columns=self.columns)
ex1 = DatF1(summa)
In [10]: ex1.meth(3,4)
Out[10]: 7
ex1.refresh()
In [11]: ex1
Out[11]: Empty DatF1
Columns: [(Index, First), (Index, Last)]
Index: []
After refresh(), ex1 is still empty. refresh has not worked correctly.
# Works
class DatF2(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
super().__init__(values, index=rmidx, columns=self.columns)
ex2 = DatF2(summa)
In [10]: ex2.meth(3,4)
Out[10]: 7
ex2.refresh()
In [11]: ex2
Out[11]: Index
First Last
Component Interval
Comp1 1h 2020-02-10 08:00:00 2020-02-10 08:00:00
1W 2020-02-11 08:00:00 2020-02-12 08:00:00
This code works!
I have 2 questions:
why the 1st code is not working? (I am sorry, this is maybe obvious, but I am completely ignorant why it does not work)
is calling super().__init__ in my refresh method acceptable coding practise? (or rephrased differently: is it acceptable to call super().__init__ in other places than in __init__ of my subclass?)
Thanks a lot for your help and advice. The world of class inheritance is for me quite new, and the fact that DataFrame content cannot be directly modified, so to say, seems to me to make it a step more difficult to handle.
Have a good day,
Bests,
Error message when adding a new row
import pandas as pd
import numpy as np
# Dummy function
def new_rows():
return [['Comp1','Comp1'],['1h','1W']]
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = self.meth()
self[rmidx] = values
ex1 = DatF1(new_rows)
ex1.refresh()
KeyError: "None of [MultiIndex([('Comp1', 'Comp1'),\n ( '1h', '1W')],\n names=['Component', 'Interval'])] are in the [index]"
Answers to your questions
why the 1st code is not working?
You are trying to call the class you've inherited from. Honestly, I don't know what's happening exactly in your case. I assumed this would produce an error but you got an empty dataframe.
is calling super().__init__ in my refresh method acceptable coding practise?
Maybe a legitimate use case exists for calling super().__init__ outside the __init__() method. But your case is not one of them. You have already inherited evertyhing from in your __init__() . Why use it again?
A better solution
The solution to your problem is unexpectedly simple. Because you can append a row to a Dataframe:
df['new_row'] = [value_1, value_2, ...]
Or in your case with an MultiIndex (see this SO post):
df.loc[('1h', '1W'), :] = [pd.Timestamp('2020/02/10 8:00'), pd.Timestamp('2020/02/10 8:00')]
Best practice
You should not inherit from pd.DataFrame. If you want to extend pandas use the documented API.

Proper way to extend Python class

I'm looking to extend a Panda's DataFrame, creating an object where all of the original DataFrame attributes/methods are in tact, while making a few new attributes/methods available. I also need the ability to convert (or copy) objects that are already DataFrames to my new class. What I have seems to work, but I feel like I might have violated some fundamental convention. Is this the proper way of doing this, or should I even be doing it in the first place?
import pandas as pd
class DataFrame(pd.DataFrame):
def __init__(self, df):
df.__class__ = DataFrame # effectively 'cast' Pandas DataFrame as my own
the idea being I could then initialize it directly from a Pandas DataFrame, e.g.:
df = DataFrame(pd.read_csv(path))
I'd probably do it this way, if I had to:
import pandas as pd
class CustomDataFrame(pd.DataFrame):
#classmethod
def convert_dataframe(cls, df):
df.__class__ = cls
return df
def foo(self):
return "Works"
df = pd.DataFrame([1,2,3])
print(df)
#print(df.foo()) # Will throw, since .foo() is not defined on pd.DataFrame
cdf = CustomDataFrame.convert_dataframe(df)
print(cdf)
print(cdf.foo()) # "Works"
Note: This will forever change the df object you pass to convert_dataframe:
print(type(df)) # <class '__main__.CustomDataFrame'>
print(type(cdf)) # <class '__main__.CustomDataFrame'>
If you don't want this, you could copy the dataframe inside the classmethod.
If you just want to add methods to a DataFrame just monkey patch before you run anything else as below.
>>> import pandas
>>> def foo(self, x):
... return x
...
>>> foo
<function foo at 0x00000000009FCC80>
>>> pandas.DataFrame.foo = foo
>>> bar = pandas.DataFrame()
>>> bar
Empty DataFrame
Columns: []
Index: []
>>> bar.foo(5)
5
>>>
if __name__ == '__main__':
app = DataFrame()
app()
event
super(DataFrame,self).__init__()

Changing self.variables inside __exit__ method of Context Managers

First thing first, the title is very unclear, however nothing better sprang to my mind. I'll ellaborate the problem in more detail.
I've found myself doing this routine a lot with pandas dataframes. I need to work for a while with only the part(some columns) of the DataFrame and later I want to add those columns back. The an idea came to my mind = Context Managers. But I am unable to come up with the correct implementation (if there is any..).
import pandas as pd
import numpy as np
class ProtectColumns:
def __init__(self, df, protect_cols=[]):
self.protect_cols = protect_cols
# preserve a copy of the part we want to protect
self.protected_df = df[protect_cols].copy(deep=True)
# create self.df with only the part we want to work on
self.df = df[[x for x in df.columns if x not in protect_cols]]
def __enter__(self):
# return self, or maybe only self.df?
return self
def __exit__(self, *args, **kwargs):
# btw. do i need *args and **kwargs here?
# append the preserved data back to the original, now changed
self.df[self.protect_cols] = self.protected_df
if __name__ == '__main__':
# testing
# create random DataFrame
df = pd.DataFrame(np.random.randn(6,4), columns=list("ABCD"))
# uneccessary step
df = df.applymap(lambda x: int(100 * x))
# show it
print(df)
# work without cols A and B
with ProtectColumns(df, ["A", "B"]) as PC:
# make everything 0
PC.df = PC.df.applymap(lambda x: 0)
# this prints the expected output
print(PC.df)
However, say I don't want to use PC.df onwards, but df. I could just do df = PC.df, or make a copy inside with or after that. But is is possible to handle this inside e.g. the __exit__ method?
# unchanged df
print(df)
with ProtectColumns(df, list("AB")) as PC:
PC.applymap(somefunction)
# df is now changed
print(df)
Thanks for any ideas!

Propagate pandas series metadata through joins

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df
I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
DataFrame._metadata = ['name','filename']
def __finalize__(self, other, method=None, **kwargs):
"""
propagate metadata from other to self
Parameters
----------
other : the object from which to get the attributes that we are going
to propagate
method : optional, a passed method name ; possibly to take different
types of propagation actions based on this
"""
### you need to arbitrate when their are conflicts
for name in self._metadata:
object.__setattr__(self, name, getattr(other, name, None))
return self
DataFrame.__finalize__ = __finalize__
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.
This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

Categories

Resources