I would like to create a method inside a class that gets a variable and a function as input arguments and return a new value. In below example the arbitrary function can be max, min, mean, or ...:
import pandas as pd
df = pd.DataFrame( {'col1': [1, 2], 'col2': [4, 6]})
df.max(axis=1), df.min(axis=1), df.mean(axis=1) # sample of methods that I would like to pass
I would like to do similar through a method inside a class. My attempt so far that does not work:
class our_class():
def __init__(self, df):
self.df = df
def arb_func(self, func):
return self.df.func()
ob1 = our_class(df)
ob1.arb_func(max(axis=1))
Any suggestions appreciated.
PS: It is a toy problem here. My goal is to be able to get a data frame and do arbitrary number of statistical analysis on it later. I do not want to hardcode the statistical analysis and let it change later if needed.
You could try this:
class our_class():
def __init__(self, df):
self.df = df
def arb_func(self, func):
return func(self.df)
You could then use it like this:
ob1 = our_class(df)
ob1.arb_func(lambda x: x.max(axis=1))
New suggestion
As long as you make sure the function you pass requires a dataframe as its first argument, the problem becomes simple as (as already noted by #jjramsey):
class our_class():
def __init__(self, df):
self.df = df
def arb_func(self, func):
return func(self.df)
Virtually any method of pd.DataFrame, i.e. a method having a self as first input, for instance pd.DataFrame.max source, is directly compatible with this use. In this version you would have to be passing partial functions every time you would need some additional configurations in the form of arguments and keyword arguments. In your case this is the use of axis=1. A little modification to the above implementation can account for such situations:
class our_class():
def __init__(self, df):
self.df = df
def arb_func(self, func, *args, **kwargs):
return func(self.df, *args, **kwargs)
Now this implementation is that generic that you can pass your own functions as well as long as the first parameter is the dataframe. For instance, you would like to count how many apples you have with your own count_apples function as:
def count_apples(df, apples_column_name):
return df[apples_column_name].eq('apple').sum()
Now making use of it as:
df = pd.DataFrame({"fruits_in_store": ["apple", "apple", "pear", "banana", "papaya"]})
ob1.arb_func(count_apples, "fruits_in_store") # it is possible to pass this into the `apples_column_name` as an arg
ob1.arb_func(count_apples, apples_column_name="fruits_in_store") # or you can be explicit
Original answer
I assume the OP is trying to generate some generic coding interface for educational purposes?
Here a suggestion (which in my opinion is actually making the usage way more complex than necessary, as many other users have already noted in their questions/comments):
from functools import partial
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
class our_class():
def __init__(self, df):
self.df = df
def arb_func(self, func: str, **kwargs):
return partial(getattr(pd.DataFrame, func), **kwargs)(df)
ob1 = our_class(df)
print(ob1.arb_func("max", axis=1))
0 1
1 2
2 3
dtype: int64
print(ob1.arb_func("max", axis=0))
a 3
dtype: int64
Related
I'm sure I'm missing something in how classes work here, but basically this is my class:
import pandas as pd
import numpy as np
import scipy
#example DF with OHLC columns and 100 rows
gold = pd.DataFrame({'Open':[i for i in range(100)],'Close':[i for i in range(100)],'High':[i for i in range(100)],'Low':[i for i in range(100)]})
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
self.levels = pivot_points(self.df)
def pivot_points(self,df,period=30):
highs = scipy.signal.argrelmax(df.High.values,order=period)
lows = scipy.signal.argrelmin(df.Low.values,order=period)
return list(df.High[highs[0]]) + list(df.Low[lows[0]])
inst = Backtest('gold',gold) #gold is a Pandas Dataframe with Open High Low Close columns and data
inst.levels # This give me the whole dataframe (inst.df) instead of the expected output of the pivot_point function (a list of integers)
The problem is inst.levels returns the whole DataFrame instead of the return value of the function pivot_points (which is supposed to be a list of integers)
When I called the pivot_points function on the same DataFrame outside this class I got the list I expected
I expected to get the result of the pivot_points() function after assigning it to self.levels inside the init but instead I got the entire DataFrame
You would have to address pivot_points() as self.pivot_points()
And there is no need to add period as an argument if you are not changing it, if you are, its okay there.
I'm not sure if this helps, but here are some tips about your class:
class Backtest:
def __init__(self, ticker, df):
self.ticker = ticker
self.df = df
# no need to define a instance variable here, you can access the method directly
# self.levels = pivot_points(self.df)
def pivot_points(self):
period = 30
# period is a local variable to pivot_points so I can access it directly
print(f'period inside Backtest.pivot_points: {period}')
# df is an instance variable and can be accessed in any method of Backtest after it is instantiated
print(f'self.df inside Backtest.pivot_points(): {self.df}')
# to get any values out of pivot_points we return some calcualtions
return 1 + 1
# if you do need an attribute like level to access it by inst.level you could create a property
#property
def level(self):
return self.pivot_points()
gold = 'some data'
inst = Backtest('gold', gold) # gold is a Pandas Dataframe with Open High Low Close columns and data
print(f'inst.pivot_points() outside the class: {inst.pivot_points()}')
print(f'inst.level outside the class: {inst.level}')
This would be the result:
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.pivot_points() outside the class: 2
period inside Backtest.pivot_points: 30
self.df inside Backtest.pivot_points(): some data
inst.level outside the class: 2
Thanks to the commenter Henry Ecker I found that I had the function by the same name defined elsewhere in the file where the output is the df. After changing that my original code is working as expected
Hello users of pandas,
I often find myself printing the shapes of the dataframes after every step of processing. I do this to monitor how the shape of the data changes and to ensure that it is done correctly.
e.g.
print(df.shape)
df=df.dropna()
print(df.shape)
df=df.melt()
print(df.shape)
...
I wonder if there is any better/elegant way, preferably a shorthad or an automatic way to do this kind of stuff.
I believe that what you're doing is entirely fine - especially as you are exploring. The code is easy to read and there isn't too much repetitive code. If you really wanted to reduce lines of code, you could utilize a helper function that could wrap whatever you are trying to run. For example:
def df_caller(df, fn, *args, **kwargs):
new_df = getattr(df, fn)(*args, **kwargs)
print(new_df.shape)
assert df.shape == new_df.shape
return new_df
df = df_caller(df, 'dropna')
df = df_caller(df, 'melt')
...
However, in my opinion the meta programming in the above solution is a little too magical and harder to read than what you originally posted.
I improvised on Matthew Cox's answer, and added an attribute to the pandas dataframe itself. This simplifies things a lot.
import numpy as np
import pandas as pd
# set logger
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# log changes in dataframe
def log_(df, fun, *args, **kwargs):
logging.info(f'shape changed from {df.shape}', )
df1 = getattr(df, fun)(*args, **kwargs)
logging.info(f'shape changed to {df1.shape}')
return df1
# custom pandas dataframe
#pd.api.extensions.register_dataframe_accessor("log")
class log:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def dropna(self,**kws):
return log_(self._obj,fun='dropna',**kws)
# demo data
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
# trial
df.log.dropna()
# stderr
INFO:root:shape changed from (3, 3)
INFO:root:shape changed to (1, 3)
# returns dropna'd dataframe
This is a general python question. Is it possible to assign different variables to a class object and then perform different set of operations on those variables? I'm trying to reduce code but maybe this isn't how it works. For example, I'm trying to do something like this:
Edit: here is an abstract of the class and methods:
class Class:
def __init__(self, df):
self.df = df
def query(self, query):
self.df = self.df.query(query)
return self
def fill(self, filter):
self.df.update(df.filter(like=filter).mask(lambda x: x == 0).ffill(1))
return self
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
self.df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return self
def melt(self, cols, var=None, value=None):
return pd.melt(self.df, id_vars=columns, var_name=var, value_name=value)
I'm trying to use it like this:
df = pd.read_csv('data.csv')
df = Class(df)
df = df.query(query).forward_fill(include)
df_1 = df.diff(cols).melt(cols)
df_2 = df.melt(cols)
df_1 and df_2 should have different values, however they are the same as df_1. This issue is resolved if I use the class like this:
df_1 = pd.read_csv('data.csv')
df_2 = pd.read_csv('data.csv')
df_1 = Class(df_1)
df_2 = Class(df_2)
df_1 = df_1.query(query).forward_fill(include)
df_2 = df_2.query(query).forward_fill(include)
df_1 = df_1.diff(cols).melt(cols)
df_2 = df_2.melt(cols)
This results in extra code. Is there a better way to do this where you can use an object differently on different variables, or do I have to create seperate objects if I'm trying to have two variables perform separate operations and return different values?
With the return self statement in the diff- method you return the reference of the object. The same thing happens after the melt method. But in that two methods you allreadey manipulated the origin df.
Here:
1 df = pd.read_csv('data.csv')
2
3 df = Class(df)
4 df = df.query(query).forward_fill(include)
5
6 df_1 = df.diff(cols).melt(cols)
the df has the same values like df_1. I guess the melt method without other args then cols arguments only assigns col names or something like that. Subsequently df_2=df.melt(cols) would have the same result like df_2=df_1.melt(cols).
If you want to work with one object, you dont should use self.df=... in your class methods, because this changes the instance value of df. You only need to write df = ... and than return Class(df).
For example:
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return Class(df)
Best regards
I'm looking to extend a Panda's DataFrame, creating an object where all of the original DataFrame attributes/methods are in tact, while making a few new attributes/methods available. I also need the ability to convert (or copy) objects that are already DataFrames to my new class. What I have seems to work, but I feel like I might have violated some fundamental convention. Is this the proper way of doing this, or should I even be doing it in the first place?
import pandas as pd
class DataFrame(pd.DataFrame):
def __init__(self, df):
df.__class__ = DataFrame # effectively 'cast' Pandas DataFrame as my own
the idea being I could then initialize it directly from a Pandas DataFrame, e.g.:
df = DataFrame(pd.read_csv(path))
I'd probably do it this way, if I had to:
import pandas as pd
class CustomDataFrame(pd.DataFrame):
#classmethod
def convert_dataframe(cls, df):
df.__class__ = cls
return df
def foo(self):
return "Works"
df = pd.DataFrame([1,2,3])
print(df)
#print(df.foo()) # Will throw, since .foo() is not defined on pd.DataFrame
cdf = CustomDataFrame.convert_dataframe(df)
print(cdf)
print(cdf.foo()) # "Works"
Note: This will forever change the df object you pass to convert_dataframe:
print(type(df)) # <class '__main__.CustomDataFrame'>
print(type(cdf)) # <class '__main__.CustomDataFrame'>
If you don't want this, you could copy the dataframe inside the classmethod.
If you just want to add methods to a DataFrame just monkey patch before you run anything else as below.
>>> import pandas
>>> def foo(self, x):
... return x
...
>>> foo
<function foo at 0x00000000009FCC80>
>>> pandas.DataFrame.foo = foo
>>> bar = pandas.DataFrame()
>>> bar
Empty DataFrame
Columns: []
Index: []
>>> bar.foo(5)
5
>>>
if __name__ == '__main__':
app = DataFrame()
app()
event
super(DataFrame,self).__init__()
First thing first, the title is very unclear, however nothing better sprang to my mind. I'll ellaborate the problem in more detail.
I've found myself doing this routine a lot with pandas dataframes. I need to work for a while with only the part(some columns) of the DataFrame and later I want to add those columns back. The an idea came to my mind = Context Managers. But I am unable to come up with the correct implementation (if there is any..).
import pandas as pd
import numpy as np
class ProtectColumns:
def __init__(self, df, protect_cols=[]):
self.protect_cols = protect_cols
# preserve a copy of the part we want to protect
self.protected_df = df[protect_cols].copy(deep=True)
# create self.df with only the part we want to work on
self.df = df[[x for x in df.columns if x not in protect_cols]]
def __enter__(self):
# return self, or maybe only self.df?
return self
def __exit__(self, *args, **kwargs):
# btw. do i need *args and **kwargs here?
# append the preserved data back to the original, now changed
self.df[self.protect_cols] = self.protected_df
if __name__ == '__main__':
# testing
# create random DataFrame
df = pd.DataFrame(np.random.randn(6,4), columns=list("ABCD"))
# uneccessary step
df = df.applymap(lambda x: int(100 * x))
# show it
print(df)
# work without cols A and B
with ProtectColumns(df, ["A", "B"]) as PC:
# make everything 0
PC.df = PC.df.applymap(lambda x: 0)
# this prints the expected output
print(PC.df)
However, say I don't want to use PC.df onwards, but df. I could just do df = PC.df, or make a copy inside with or after that. But is is possible to handle this inside e.g. the __exit__ method?
# unchanged df
print(df)
with ProtectColumns(df, list("AB")) as PC:
PC.applymap(somefunction)
# df is now changed
print(df)
Thanks for any ideas!