Propagate pandas series metadata through joins - python

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df

I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
DataFrame._metadata = ['name','filename']
def __finalize__(self, other, method=None, **kwargs):
"""
propagate metadata from other to self
Parameters
----------
other : the object from which to get the attributes that we are going
to propagate
method : optional, a passed method name ; possibly to take different
types of propagation actions based on this
"""
### you need to arbitrate when their are conflicts
for name in self._metadata:
object.__setattr__(self, name, getattr(other, name, None))
return self
DataFrame.__finalize__ = __finalize__
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.
This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

Related

Create class that appends all entries to a DataFrame

I am having trouble with the following task.
I need to create a class that accomodates student id, name and grades of all students. My idea was to create an empty DataFrame to append the values I add to the class.
I came up with the below.
class Student:
students_archive = pd.DataFrame(index = 'student_id', columns = ['student_id', 'name', 'grade'])
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
st = {'student_id': self.s_id,'name': self.name, 'grade': self.grade}
pd.concat([Student.students_archive, st])
I am however getting the following error:
If using all scalar values, you must pass an index
I dont really understand whats wrong and I have looked it all around, can anybody help me? Thanks
I also cant help but think mine is the wrong approach since the task doesnt actually specify that it needs to be a dataframe, just says that I have to 'create a class that accomodate students name, grade, and id, and create methods to add, remove or upgrade the students values'. Perhaps I can do all of that without creating a dataframe?
Thank you
I can't comment yet so here's an answer.
Running the code shows me 2 errors:
This index = 'student_id' should be more like this index = ['student_id']
This pd.concat([Student.students_archive, st]) does not accept a dictionary rather something like pd.concat([dataframe_0, dataframe_1])
Also I think it's better if you add your dataframe inside the __init__().
So it should result in something like:
class Student:
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
self.students_archive = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
temp_df = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
new_temp_df = pd.concat([self.students_archive, temp_df])
But keep in mind that temp_df and new_temp_df wil probably be garbage collected so, keep what you want by adding self.
As I understand it:
template = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
entries = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
self.student_df = pd.concat([template, entries])
UPDATE:
Having read your comment and the code you linked, it seems like instantiating each student is out of the scope. You seem to be on the right track though.
I would still add the main dataframe inside the __init__(), keeps it more tidy imho. You can still Acceess it from outside the class (more below).
incorporate the functionality needed through methods (add/remove/update), which you started doing.
Whether this approach is correct or not is probably up to your professor and if they forbid libraries like pandas. I don't see anything wrong as it gives you much of what's needed.
So, in code I would suggest something like this:
class Students:
def __init__(self):
# keep in mind that pandas will add another column to the left for row id (0, 1, 2, etc..)
self.df = pd.DataFrame(columns = ['student_id', 'name', 'grade'])
def add_student(self, s_id, nm, gr):
# a new dataframe is returned from .append() containing the new values, we replace our old on with that
self.df = self.df.append({'student_id':s_id, 'name':nm, 'grade':gr}, ignore_index = True)
def remove_student(self, s_id):
# we are basically, getting a new dataframe without the rows that have s_id as their student_id
self.df = self.df[self.df.student_id != s_id]
# this could be broken down into 3 methods, 1 for each (id, name, grade)
def update_student(self, s_id, nm, gr):
# as i dont know how to update a row, i leave this as a placeholder:
self.remove_student(s_id)
self.add_student(s_id, nm, gr)
Accessing the dataframe from outside is as easy as:
S = Student()
print(S.df)

how to name dataframe columns based on classes arguments?

Spending a huge time on this. I have a tuple of non-callable classes, all named SymbolInfo, with same attribute labels. Lets say:
In: print(my_tuple)
Out: (SymbolInfo(att_1=False, att_2=0, att_3=1.0),SymbolInfo(att_1=True, att_2=0, att_3=1.5))
My objective is to create a dataframe from this tuple. When I convert it to list, it works fine:
df = pd.DataFrame(list(my_tuple))
I get the dataframe, but I don't get the column labels, which should be the name of the classes attributes: (i.e. att_1, att_2, att_3).
The attributes names and their quantity (not values) are standardized for all classes. So I could consider any class to get it.
I've tried methods like inspect.getmembers(my_tuple[0]) and inspect.getfullargspec(my_tuple[0]).args without success. It's important to get those arguments in the same sequence that they appear.
Got this solution:
my_dict = my_tuple[0]._asdict()
my_col_list = list(my_dict.keys())
You can access the attributes (in the order they were added) with the __dict__ method, like so:
class thing():
def __init__(self, att_1, att_2, att_3):
self.z_att_1 = att_1
self.att_2 = att_2
self.att_3 = att_3
a = thing('bob', 'eve', 'alice')
b = thing('john', 'jack', 'dan')
c = (a,b)
# see the attributes
print(c[0].__dict__)
Results in.
# note that z_att_1 is first
{'z_att_1': 'bob', 'att_2': 'eve', 'att_3': 'alice'}
Now you can loop through the dictionary and pull out the keys for the attribute names.
Just create a dict from class and then create dataframe:
pd.DataFrame(list(map(lambda x: x.__dict__, my_tuple)))
And I recommend using attrs library for OOP in python.

python/pandas: DataFrame inheritance and DataFrame update when 'inplace' is not possible

I am sorry, I am aware the title is somewhat fuzzy.
Context
I am using a Dataframe to keep track of files because pandas DataFrame features several relevant functions to do all kind of filtering a dict cannot do, with loc, pd.IndexSlice, .index, .columns, pd.MultiIndex...
Ok, so this may not appear as the best choice for expert developers (which I am not), but all these functions have been so much handy that I have come to use a DataFrame for this.
And cherry on the cake, __repr__ of a MultiIndex Dataframe is just perfect when I want to know what is inside my file list.
Quick introduction to Summary class, inheriting from DataFrame
Because my DataFrame, that I call 'Summary', has some specific functions, I would like to make it a class, inheriting from pandas DataFrame class.
It also has 'fixed' MultiIndexes, for both rows and columns.
Finally, because my Summary class is defined outside the Store class which is actually managing file organization, Summary class needs a function from Store to be able to retrieve file organization.
Questions
Trouble with pd.DataFrame is (AFAIK) you cannot append rows without creating a new DataFrame.
As Summary has a refresh function so that it can recreate itself by reading folder content, a refresh somehow 'reset' the 'Summary' object.
To manage Summary refresh, I have come up with a first code (not working) and finally a second one (working).
import pandas as pd
import numpy as np
# Dummy function
def summa(a,b):
return a+b
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
self = pd.DataFrame(values, index=rmidx, columns=self.columns)
ex1 = DatF1(summa)
In [10]: ex1.meth(3,4)
Out[10]: 7
ex1.refresh()
In [11]: ex1
Out[11]: Empty DatF1
Columns: [(Index, First), (Index, Last)]
Index: []
After refresh(), ex1 is still empty. refresh has not worked correctly.
# Works
class DatF2(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
super().__init__(values, index=rmidx, columns=self.columns)
ex2 = DatF2(summa)
In [10]: ex2.meth(3,4)
Out[10]: 7
ex2.refresh()
In [11]: ex2
Out[11]: Index
First Last
Component Interval
Comp1 1h 2020-02-10 08:00:00 2020-02-10 08:00:00
1W 2020-02-11 08:00:00 2020-02-12 08:00:00
This code works!
I have 2 questions:
why the 1st code is not working? (I am sorry, this is maybe obvious, but I am completely ignorant why it does not work)
is calling super().__init__ in my refresh method acceptable coding practise? (or rephrased differently: is it acceptable to call super().__init__ in other places than in __init__ of my subclass?)
Thanks a lot for your help and advice. The world of class inheritance is for me quite new, and the fact that DataFrame content cannot be directly modified, so to say, seems to me to make it a step more difficult to handle.
Have a good day,
Bests,
Error message when adding a new row
import pandas as pd
import numpy as np
# Dummy function
def new_rows():
return [['Comp1','Comp1'],['1h','1W']]
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = self.meth()
self[rmidx] = values
ex1 = DatF1(new_rows)
ex1.refresh()
KeyError: "None of [MultiIndex([('Comp1', 'Comp1'),\n ( '1h', '1W')],\n names=['Component', 'Interval'])] are in the [index]"
Answers to your questions
why the 1st code is not working?
You are trying to call the class you've inherited from. Honestly, I don't know what's happening exactly in your case. I assumed this would produce an error but you got an empty dataframe.
is calling super().__init__ in my refresh method acceptable coding practise?
Maybe a legitimate use case exists for calling super().__init__ outside the __init__() method. But your case is not one of them. You have already inherited evertyhing from in your __init__() . Why use it again?
A better solution
The solution to your problem is unexpectedly simple. Because you can append a row to a Dataframe:
df['new_row'] = [value_1, value_2, ...]
Or in your case with an MultiIndex (see this SO post):
df.loc[('1h', '1W'), :] = [pd.Timestamp('2020/02/10 8:00'), pd.Timestamp('2020/02/10 8:00')]
Best practice
You should not inherit from pd.DataFrame. If you want to extend pandas use the documented API.

How to Share Variable Across the in-class def in Python

class Dataframe: #Recommended to instatiate your dataframe with your csv name.
"""
otg_merge = Dataframe("/Users/zachary/Desktop/otg_merge.csv") #instaiate as a pandas dataframe
"""
def __init__(self, filepath, filename = None):
pd = __import__('pandas') #import pandas when the class is instatiated
self.filepath = filepath
self.filename = filename
def df(self): #it makes the DataFrame
df = pd.read_csv(self.filepath, encoding = "cp949", index_col= 0) #index col is not included
return df
def shape(self): #it returns the Dimension of DataFrame
shape = list(df.shape)
return shape
def head(self): #it reutrns the Head of Dataframe
primer = pd.DataFrame.head(df)
del primer["Unnamed: 0"]
return primer
def cust_types(self): #it returns the list of cust_type included in .csv
cust_type = []
for i in range(0, shape[0]):
if df.at[i, "cust_type"] not in cust_type: #if it's new..
cust_type.append(df.at[i, "cust_type"]) #append it as a new list element
return cust_type
I am doing some wrapping pandas functions wrapping for whom doesn't necessarily need to know the pandas.
If you see the code, at the third def, shape returns shape as a list of such as [11000, 134] as a xdim and ydim.
Now I'd like to use the shape again at the last def cust_types, however,, it returns the shape is not defined.
How can I share the variable "share" across defs in the same class?
intersetingly, I didn't do nth, but the df is shared from second df to thrid shape without error
First prepend "self." in all your attributes which you will know after trying out some python oops tutorials. Another issue which you might miss is
def df(self):
df = pd.read_csv(self.filepath, encoding = "cp949", index_col= 0)
return df
Here, the method name and the variable name takes the same name which is fine, if the variable name is not an instance attribute as it is not. But in case if you prepend "self." and make it as an instance attribute, your instance attribute will be self.df and it can't be a function after the first function call self.df().

Why does subclassing a DataFrame mutate the original object?

I am ignoring the warnings and trying to subclass a pandas DataFrame. My reasons for doing so are as follows:
I want to retain all the existing methods of DataFrame.
I want to set a few additional attributes at class instantiation, which will later be used to define additional methods that I can call on the subclass.
Here's a snippet:
class SubFrame(pd.DataFrame):
def __init__(self, *args, **kwargs):
freq = kwargs.pop('freq', None)
ddof = kwargs.pop('ddof', None)
super(SubFrame, self).__init__(*args, **kwargs)
self.freq = freq
self.ddof = ddof
self.index.freq = pd.tseries.frequencies.to_offset(self.freq)
#property
def _constructor(self):
return SubFrame
Here's a use example. Say I have the DataFrame
print(df)
col0 col1 col2
2014-07-31 0.28393 1.84587 -1.37899
2014-08-31 5.71914 2.19755 3.97959
2014-09-30 -3.16015 -7.47063 -1.40869
2014-10-31 5.08850 1.14998 2.43273
2014-11-30 1.89474 -1.08953 2.67830
where the index has no frequency
print(df.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
'2014-11-30'],
dtype='datetime64[ns]', freq=None)
Using SubFrame allows me to specify that frequency in one step:
sf = SubFrame(df, freq='M')
print(sf.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
'2014-11-30'],
dtype='datetime64[ns]', freq='M')
The issue is, this modifies df:
print(df.index.freq)
<MonthEnd>
What's going on here, and how can I avoid this?
Moreover, I profess to using copied code that I don't understand all that well. What is happening within __init__ above? Is it necessary to use args/kwargs with pop here? (Why can't I just specify params as usual?)
I'll add to the warnings. Not that I want to discourage you, I actually applaud your efforts.
However, this won't the last of your questions as to what is going on.
That said, once you run:
super(SubFrame, self).__init__(*args, **kwargs)
self is a bone-fide dataframe. You created it by passing another dataframe to the constructor.
Try this as an experiment
d1 = pd.DataFrame(1, list('AB'), list('XY'))
d2 = pd.DataFrame(d1)
d2.index.name = 'IDX'
d1
X Y
IDX
A 1 1
B 1 1
So the observed behavior is consistent, in that when you construct one dataframe by passing another dataframe to the constructor, you end up pointing to the same objects.
To answer your question, subclassing isn't what is allowing the mutating of the original object... its the way pandas constructs a dataframe from a passed dataframe.
Avoid this by instantiating with a copy
d2 = pd.DataFrame(d1.copy())
What's going on in the __init__
You want to pass on all the args and kwargs to pd.DataFrame.__init__ with the exception of the specific kwargs that are intended for your subclass. In this case, freq and ddof. pop is a convenient way to grab the values and delete the key from kwargs before passing it on to pd.DataFrame.__init__
How I'd implement pipe
def add_freq(df, freq):
df = df.copy()
df.index.freq = pd.tseries.frequencies.to_offset(freq)
return df
df = pd.DataFrame(dict(A=[1, 2]), pd.to_datetime(['2017-03-31', '2017-04-30']))
df.pipe(add_freq, 'M')

Categories

Resources