Changing self.variables inside __exit__ method of Context Managers - python

First thing first, the title is very unclear, however nothing better sprang to my mind. I'll ellaborate the problem in more detail.
I've found myself doing this routine a lot with pandas dataframes. I need to work for a while with only the part(some columns) of the DataFrame and later I want to add those columns back. The an idea came to my mind = Context Managers. But I am unable to come up with the correct implementation (if there is any..).
import pandas as pd
import numpy as np
class ProtectColumns:
def __init__(self, df, protect_cols=[]):
self.protect_cols = protect_cols
# preserve a copy of the part we want to protect
self.protected_df = df[protect_cols].copy(deep=True)
# create self.df with only the part we want to work on
self.df = df[[x for x in df.columns if x not in protect_cols]]
def __enter__(self):
# return self, or maybe only self.df?
return self
def __exit__(self, *args, **kwargs):
# btw. do i need *args and **kwargs here?
# append the preserved data back to the original, now changed
self.df[self.protect_cols] = self.protected_df
if __name__ == '__main__':
# testing
# create random DataFrame
df = pd.DataFrame(np.random.randn(6,4), columns=list("ABCD"))
# uneccessary step
df = df.applymap(lambda x: int(100 * x))
# show it
print(df)
# work without cols A and B
with ProtectColumns(df, ["A", "B"]) as PC:
# make everything 0
PC.df = PC.df.applymap(lambda x: 0)
# this prints the expected output
print(PC.df)
However, say I don't want to use PC.df onwards, but df. I could just do df = PC.df, or make a copy inside with or after that. But is is possible to handle this inside e.g. the __exit__ method?
# unchanged df
print(df)
with ProtectColumns(df, list("AB")) as PC:
PC.applymap(somefunction)
# df is now changed
print(df)
Thanks for any ideas!

Related

__init__() got multiple values for argument 'use_technical_indicator' - error

I can't figure out why I am getting this error. If you can figure it out, I'd appreciate it. If you can provide specific instruction, I'd appreciate it. This code is in one module; there are 7 modules total.
Python 3.7, Mac OS, code from www.finrl.org
# Perform Feature Engineering:
df = FeatureEngineer(df.copy(),
use_technical_indicator=True,
use_turbulence=False).preprocess_data()
# add covariance matrix as states
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]
cov_list = []
# look back is one year
lookback=252
for i in range(lookback,len(df.index.unique())):
data_lookback = df.loc[i-lookback:i,:]
price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
return_lookback = price_lookback.pct_change().dropna()
covs = return_lookback.cov().values
cov_list.append(covs)
df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list})
df = df.merge(df_cov, on='date')
df = df.sort_values(['date','tic']).reset_index(drop=True)
df.head()
The function definition statement for FeatureEngineer.__init__ is:
def __init__(
self,
use_technical_indicator=True,
tech_indicator_list=config.TECHNICAL_INDICATORS_LIST,
use_turbulence=False,
user_defined_feature=False,
):
As you can see there is no argument (other than self which you should not provide) before use_technical_indicator, so you should remove the df.copy() from before the use_techincal_indicator in your line 2.
Checking the current FeatureEngineer class, you must to provide the df.copy() parameter to the preprocess_data() method.
So, your code have to look like:
# Perform Feature Engineering:
df = FeatureEngineer(use_technical_indicator=True,
tech_indicator_list = config.TECHNICAL_INDICATORS_LIST,
use_turbulence=True,
user_defined_feature = False).preprocess_data(df.copy())

Alternative to repeatedly printing shapes of the pandas dataframe after every step

Hello users of pandas,
I often find myself printing the shapes of the dataframes after every step of processing. I do this to monitor how the shape of the data changes and to ensure that it is done correctly.
e.g.
print(df.shape)
df=df.dropna()
print(df.shape)
df=df.melt()
print(df.shape)
...
I wonder if there is any better/elegant way, preferably a shorthad or an automatic way to do this kind of stuff.
I believe that what you're doing is entirely fine - especially as you are exploring. The code is easy to read and there isn't too much repetitive code. If you really wanted to reduce lines of code, you could utilize a helper function that could wrap whatever you are trying to run. For example:
def df_caller(df, fn, *args, **kwargs):
new_df = getattr(df, fn)(*args, **kwargs)
print(new_df.shape)
assert df.shape == new_df.shape
return new_df
df = df_caller(df, 'dropna')
df = df_caller(df, 'melt')
...
However, in my opinion the meta programming in the above solution is a little too magical and harder to read than what you originally posted.
I improvised on Matthew Cox's answer, and added an attribute to the pandas dataframe itself. This simplifies things a lot.
import numpy as np
import pandas as pd
# set logger
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# log changes in dataframe
def log_(df, fun, *args, **kwargs):
logging.info(f'shape changed from {df.shape}', )
df1 = getattr(df, fun)(*args, **kwargs)
logging.info(f'shape changed to {df1.shape}')
return df1
# custom pandas dataframe
#pd.api.extensions.register_dataframe_accessor("log")
class log:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def dropna(self,**kws):
return log_(self._obj,fun='dropna',**kws)
# demo data
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
# trial
df.log.dropna()
# stderr
INFO:root:shape changed from (3, 3)
INFO:root:shape changed to (1, 3)
# returns dropna'd dataframe

Python assign different variables to a class object

This is a general python question. Is it possible to assign different variables to a class object and then perform different set of operations on those variables? I'm trying to reduce code but maybe this isn't how it works. For example, I'm trying to do something like this:
Edit: here is an abstract of the class and methods:
class Class:
def __init__(self, df):
self.df = df
def query(self, query):
self.df = self.df.query(query)
return self
def fill(self, filter):
self.df.update(df.filter(like=filter).mask(lambda x: x == 0).ffill(1))
return self
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
self.df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return self
def melt(self, cols, var=None, value=None):
return pd.melt(self.df, id_vars=columns, var_name=var, value_name=value)
I'm trying to use it like this:
df = pd.read_csv('data.csv')
df = Class(df)
df = df.query(query).forward_fill(include)
df_1 = df.diff(cols).melt(cols)
df_2 = df.melt(cols)
df_1 and df_2 should have different values, however they are the same as df_1. This issue is resolved if I use the class like this:
df_1 = pd.read_csv('data.csv')
df_2 = pd.read_csv('data.csv')
df_1 = Class(df_1)
df_2 = Class(df_2)
df_1 = df_1.query(query).forward_fill(include)
df_2 = df_2.query(query).forward_fill(include)
df_1 = df_1.diff(cols).melt(cols)
df_2 = df_2.melt(cols)
This results in extra code. Is there a better way to do this where you can use an object differently on different variables, or do I have to create seperate objects if I'm trying to have two variables perform separate operations and return different values?
With the return self statement in the diff- method you return the reference of the object. The same thing happens after the melt method. But in that two methods you allreadey manipulated the origin df.
Here:
1 df = pd.read_csv('data.csv')
2
3 df = Class(df)
4 df = df.query(query).forward_fill(include)
5
6 df_1 = df.diff(cols).melt(cols)
the df has the same values like df_1. I guess the melt method without other args then cols arguments only assigns col names or something like that. Subsequently df_2=df.melt(cols) would have the same result like df_2=df_1.melt(cols).
If you want to work with one object, you dont should use self.df=... in your class methods, because this changes the instance value of df. You only need to write df = ... and than return Class(df).
For example:
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return Class(df)
Best regards

Adding a column to dask dataframe, computing it through a rolling window

Suppose I have the following code, to generate a dummy dask dataframe:
import pandas as pd
import dask.dataframe as dd
pandas_dataframe = pd.DataFrame({'A' : [0,500,1000], 'B': [-100, 200, 300] , 'C' : [0,0,1.0] } )
test_data_frame = dd.from_pandas( pandas_dataframe, npartitions= 1 )
Ideally I would like to know what is the recommended way to add another column to the data frame, computing the column content through a rolling window, in a lazy fashion.
I came up with the following approach:
import numpy as np
import dask.delayed as delay
#delay
def coupled_operation_example(dask_dataframe,
list_of_input_lbls,
fcn,
window_size,
init_value,
output_lbl):
def preallocate_channel_data(vector_length, first_components):
vector_out = np.zeros(len(dask_dataframe))
vector_out[0:len(first_components)] = first_components
return vector_out
def create_output_signal(relevant_data, fcn, window_size , initiated_vec):
## to be written; fcn would be a fcn accepting the sliding window
initiatied_vec = preallocate_channel_data(len(dask_dataframe, init_value))
relevant_data = dask_dataframe[list_of_input_lbls]
my_output_signal = create_output_signal(relevant_data, fcn, window_size, initiated_vec)
I was writing this, convinced that dask dataframe would allow me some slicing: they do not. So, my first option would be to extract the columns involved in the computations as numpy arrays, but so they would be eagerly evaluated. I think the penalty in performance would be significant. At the moment I create dask dataframes from h5 data, using h5py: so everything is lazy, until I write output files.
Up to now I was processing only data on a certain row; so I had been using:
test_data_frame .apply(fcn, axis =1, meta = float)
I do not think there is an equivalent functional approach for rolling windows; am I right? I would like something like Seq.windowed in F# or Haskell. Any suggestion highly appreciated.
I have tried to solve it through a closure. I will post benchmarks on some data, as soon as I have finalized the code. For now I have the following toy example, which seems to work: since dask dataframe's apply methods seems to be preserving the row order.
import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30
df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 2)), columns=list('AB'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )
def sumPrevious( previousState ) :
def getValue(row):
nonlocal previousState
something = row['A'] - previousState
previousState = row['A']
return something
return getValue
given_func = sumPrevious(1 )
out = my_data_frame.apply(given_func, axis = 1 , meta = float)
df['computed'] = out.compute()
Now the bad news, I have tried to abstract it out, passing the state around and using a rolling window of any width, through this new function:
def generalised_coupled_computation(previous_state , coupled_computation, previous_state_update) :
def inner_function(actual_state):
nonlocal previous_state
actual_value = coupled_computation(actual_state , previous_state )
previous_state = previous_state_update(actual_state, previous_state)
return actual_value
return inner_function
Suppose we initialize the function with:
init_state = df.loc[0]
coupled_computation = lambda act,prev : act['A'] - prev['A']
new_update = lambda act, prev : act
given_func3 = generalised_coupled_computation(init_state , coupled_computation, new_update )
out3 = my_data_frame.apply(given_func3, axis = 1 , meta = float)
Try to run it and be ready for surprises: the first element is wrong, possibly some pointer's problems, given the odd result. Any insight?
Anyhow, if one passes primitive types, it seems to function.
Update:
the solution is in using copy:
import copy as copy
def new_update(act, previous):
return copy.copy(act)
Now the functions behaves as expected; of course it is necessary to adapt the function updates and the coupled computation function if one needs a more coupled logic

Proper way to extend Python class

I'm looking to extend a Panda's DataFrame, creating an object where all of the original DataFrame attributes/methods are in tact, while making a few new attributes/methods available. I also need the ability to convert (or copy) objects that are already DataFrames to my new class. What I have seems to work, but I feel like I might have violated some fundamental convention. Is this the proper way of doing this, or should I even be doing it in the first place?
import pandas as pd
class DataFrame(pd.DataFrame):
def __init__(self, df):
df.__class__ = DataFrame # effectively 'cast' Pandas DataFrame as my own
the idea being I could then initialize it directly from a Pandas DataFrame, e.g.:
df = DataFrame(pd.read_csv(path))
I'd probably do it this way, if I had to:
import pandas as pd
class CustomDataFrame(pd.DataFrame):
#classmethod
def convert_dataframe(cls, df):
df.__class__ = cls
return df
def foo(self):
return "Works"
df = pd.DataFrame([1,2,3])
print(df)
#print(df.foo()) # Will throw, since .foo() is not defined on pd.DataFrame
cdf = CustomDataFrame.convert_dataframe(df)
print(cdf)
print(cdf.foo()) # "Works"
Note: This will forever change the df object you pass to convert_dataframe:
print(type(df)) # <class '__main__.CustomDataFrame'>
print(type(cdf)) # <class '__main__.CustomDataFrame'>
If you don't want this, you could copy the dataframe inside the classmethod.
If you just want to add methods to a DataFrame just monkey patch before you run anything else as below.
>>> import pandas
>>> def foo(self, x):
... return x
...
>>> foo
<function foo at 0x00000000009FCC80>
>>> pandas.DataFrame.foo = foo
>>> bar = pandas.DataFrame()
>>> bar
Empty DataFrame
Columns: []
Index: []
>>> bar.foo(5)
5
>>>
if __name__ == '__main__':
app = DataFrame()
app()
event
super(DataFrame,self).__init__()

Categories

Resources