Slicing Pandas DataFrames without losing the DataFrames attributes

Slicing Pandas DataFrames without losing the DataFrames attributes - python

I like to store metadata about a dataframe by simply setting an attribute and its corresponding value, like this:
df.foo = "bar"
However, I've found that attributes stored like this are gone once I slice the dataframe:
df.foo = "bar"
df[:100].foo
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\admin\PycharmProjects\project\venv\lib\site-packages\pandas\core\generic.py", line 5465, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'foo'
I wonder if this behavior can be changed, similar to how drop=True or inplace=True change the way attributes like df.set_index(args) work. I didn't find anything helpful in the pandas docs.

For many operations, pandas returns a new object so any attributes you have defined, which aren't natively supported in the pd.DataFrame class will not persist.
A simple alternative is to subclass the DataFrame. You need to be sure to add the attribute to the _metadata else it wont persist
import pandas as pd
class MyDataFrame(pd.DataFrame):
# temporary properties
_internal_names = pd.DataFrame._internal_names
_internal_names_set = set(_internal_names)
# normal properties
_metadata = ["foo"]
#property
def _constructor(self):
return MyDataFrame
df = MyDataFrame({'data': range(10)})
df.foo = 'bar'
df[:100].foo
#'bar'

Related

Extract quarter information from numpy datetime64 obkect

I have below numpy datetime64 object
import numpy as np
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100')
I would like to express this in YearQuarter i.e. '2012Q2'. Is there any method available to perform this? I tried with pandas Timestamp method but it generates error:
import pandas as pd
>>> pd.Timestamp(date_time).dt.quarter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Timestamp' object has no attribute 'dt'
Any pointer will be very helpful

There are various ways that one can achieve that, depending on the desired output type.
If one wants the type pandas._libs.tslibs.period.Period, then one can use:
pandas.Period as follows
year_quarter = pd.Period(date_time, freq='Q')
[Out]: 2012Q2
pandas.Timestamp, as user7864386 mentioned, as follows
year_quarter = pd.Timestamp(date_time).to_period('Q')
[Out]: 2012Q2
Alternatively, if one wants the final output to be a string, one will have to pass pandas.Series.dt.strftime, more specifically .strftime('%YQ%q'), such as
year_quarter = pd.Period(date_time, freq='Q').strftime('%YQ%q')
# or
year_quarter = pd.Timestamp(date_time).to_period('Q').strftime('%YQ%q')
Notes:
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100') gives a
DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
To check the variable year_quarter type, one can do the following
print(type(year_quarter))

Is it possible to use a keyword name from **kwargs to filter my data frame?

Apologies if the title is a bit obscure, I am happy to change it..
Problem: I am trying to use a keyword name in the following code to filter by column name in a dataframe using pandas.
#staticmethod
def filter_json(json, col_filter, **kwargs):
'''
Convert and filter a JSON object into a dataframe
'''
df = pd.read_json(json).drop(col_filter, axis=1)
for arg in kwargs:
df = df[(df.arg.isin(kwargs[arg]))]
return df
However I get error AttributeError: 'DataFrame' object has no attribute 'arg' because arg is not a valid column name (makes sense) at line df.arg.isin(kwargs[arg]))]
I am calling the method with the following...
filter_json(json_obj, MY_COL_FILTERS, IsOpen=['false', 0])
Meaning df.arg should essentially be df.IsOpen
Question: Is there a way to use arg as my column name (IsOpen) here? Rather then me having to input it manually as df.IsOpen

You can access columns with dataframe[columnname] notation as well: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
Try:
for arg in kwargs: # arg is 'IsOpen'
df = df[(df[arg].isin(kwargs[arg]))] # df['IsOpen'] is same as df.IsOpen

Proper way to extend Python class

I'm looking to extend a Panda's DataFrame, creating an object where all of the original DataFrame attributes/methods are in tact, while making a few new attributes/methods available. I also need the ability to convert (or copy) objects that are already DataFrames to my new class. What I have seems to work, but I feel like I might have violated some fundamental convention. Is this the proper way of doing this, or should I even be doing it in the first place?
import pandas as pd
class DataFrame(pd.DataFrame):
def __init__(self, df):
df.__class__ = DataFrame # effectively 'cast' Pandas DataFrame as my own
the idea being I could then initialize it directly from a Pandas DataFrame, e.g.:
df = DataFrame(pd.read_csv(path))

I'd probably do it this way, if I had to:
import pandas as pd
class CustomDataFrame(pd.DataFrame):
#classmethod
def convert_dataframe(cls, df):
df.__class__ = cls
return df
def foo(self):
return "Works"
df = pd.DataFrame([1,2,3])
print(df)
#print(df.foo()) # Will throw, since .foo() is not defined on pd.DataFrame
cdf = CustomDataFrame.convert_dataframe(df)
print(cdf)
print(cdf.foo()) # "Works"
Note: This will forever change the df object you pass to convert_dataframe:
print(type(df)) # <class '__main__.CustomDataFrame'>
print(type(cdf)) # <class '__main__.CustomDataFrame'>
If you don't want this, you could copy the dataframe inside the classmethod.

If you just want to add methods to a DataFrame just monkey patch before you run anything else as below.
>>> import pandas
>>> def foo(self, x):
... return x
...
>>> foo
<function foo at 0x00000000009FCC80>
>>> pandas.DataFrame.foo = foo
>>> bar = pandas.DataFrame()
>>> bar
Empty DataFrame
Columns: []
Index: []
>>> bar.foo(5)
5
>>>

if __name__ == '__main__':
app = DataFrame()
app()
event
super(DataFrame,self).__init__()

Populating an object from dataframe

Currently trying to implement Genetic Algorithm. I have built a Python class Gene
I am trying to load an object Gene from a dataframe df
class Gene:
def __init__(self,id,nb_trax,nb_days):
self.id=id
self.nb_trax=nb_trax
self.nb_days=nb_days
and then create another object Chrom
class Chromosome(object):
def __init__(self):
self.port = [Gene() for id in range(20)]
And a second class Chromosome with 20 Gene objects as its property
This is the dataframe
ID nb_obj nb_days
ECGYE 10259 62.965318
NLRTM 8007 46.550562
I successfully loaded the Gene using
tester=df.apply(lambda row: Gene(row['Injection Port'],row['Avg Daily Injection'],random.randint(1,10)), axis=1)
But i cannot load Chrom class using
f=Chromosome(tester)
I get this error
Traceback (most recent call last):
File "chrom.py", line 27, in <module>
f=Chromosome(tester)
TypeError: __init__() takes 1 positional argument but 2 were given
Any help please?

The error is misleading because it says __init__ takes 1 positional argument (which is the self from the object of the class Chromosome).
Secondly, what you are getting from the operation on df in tester is actually a DataFrame indexed as df with one column of Gene values.
To solve this you would have to change the code along these lines:
class Chromosome(object):
def __init__(self, df):
self.port = [Gene() for id in range(20)]
self.xxx = list(df)

Propagate pandas series metadata through joins

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df

I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
DataFrame._metadata = ['name','filename']
def __finalize__(self, other, method=None, **kwargs):
"""
propagate metadata from other to self
Parameters
----------
other : the object from which to get the attributes that we are going
to propagate
method : optional, a passed method name ; possibly to take different
types of propagation actions based on this
"""
### you need to arbitrate when their are conflicts
for name in self._metadata:
object.__setattr__(self, name, getattr(other, name, None))
return self
DataFrame.__finalize__ = __finalize__
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.
This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slicing Pandas DataFrames without losing the DataFrames attributes - python

Related

Extract quarter information from numpy datetime64 obkect

Is it possible to use a keyword name from **kwargs to filter my data frame?

Proper way to extend Python class

Populating an object from dataframe

Propagate pandas series metadata through joins

Categories

Resources