Hello users of pandas,
I often find myself printing the shapes of the dataframes after every step of processing. I do this to monitor how the shape of the data changes and to ensure that it is done correctly.
e.g.
print(df.shape)
df=df.dropna()
print(df.shape)
df=df.melt()
print(df.shape)
...
I wonder if there is any better/elegant way, preferably a shorthad or an automatic way to do this kind of stuff.
I believe that what you're doing is entirely fine - especially as you are exploring. The code is easy to read and there isn't too much repetitive code. If you really wanted to reduce lines of code, you could utilize a helper function that could wrap whatever you are trying to run. For example:
def df_caller(df, fn, *args, **kwargs):
new_df = getattr(df, fn)(*args, **kwargs)
print(new_df.shape)
assert df.shape == new_df.shape
return new_df
df = df_caller(df, 'dropna')
df = df_caller(df, 'melt')
...
However, in my opinion the meta programming in the above solution is a little too magical and harder to read than what you originally posted.
I improvised on Matthew Cox's answer, and added an attribute to the pandas dataframe itself. This simplifies things a lot.
import numpy as np
import pandas as pd
# set logger
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# log changes in dataframe
def log_(df, fun, *args, **kwargs):
logging.info(f'shape changed from {df.shape}', )
df1 = getattr(df, fun)(*args, **kwargs)
logging.info(f'shape changed to {df1.shape}')
return df1
# custom pandas dataframe
#pd.api.extensions.register_dataframe_accessor("log")
class log:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def dropna(self,**kws):
return log_(self._obj,fun='dropna',**kws)
# demo data
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
# trial
df.log.dropna()
# stderr
INFO:root:shape changed from (3, 3)
INFO:root:shape changed to (1, 3)
# returns dropna'd dataframe
Related
What is the best way to go about testing a pandas dataframe processing chain? I stubbed out the script file and the test file below so you can see what I mean.
I am getting confused on best practice, my only guiding intuition is to make the tests so they can run in any order, limit how many times the csv is loaded from disk, while also making sure each point in the chain does not modify the fixture. Each step in the process is dependent on the previous steps so unit testing each node is like testing the accumulation of processing to that point in the pipeline. So far I am accomplishing the mission but it seems like a lot of code duplication is happening because I am incrementally building the pipeline in each test.
What is the way to test this kind of python script?
This is the data processing file stubbed out:
#main_script.py
def calc_allocation_methodology(df_row):
print('calculating allocation methodoloyg')
return 'simple'
def flag_data_for_the_allocation_methodology(df):
allocation_methodology = df.apply(calc_allocation_methodology, axis=1)
df.assign(allocation_methodology=allocation_methodology)
print('flagging each row for the allocation methodoloyg')
return df
def convert_repeating_values_to_nan(df):
'keep one value and nan the rest of the values'
print('convert repeating values to nan')
return df
def melt_and_drop_accounting_columns(df):
print('melt and drop accounting columns')
print(f'colums remaining: {df.shape[0]}')
return df
def melt_and_drop_engineering_columns(df):
print('melt and drop engineering columns')
print(f'colums remaining: {df.shape[0]}')
return df
def process_csv_to_tiny_format(df):
print('process the entire pipeline')
return (df
.pipe(flag_data_for_the_allocation_methodology)
.pipe(convert_repeating_values_to_nan)
.pipe(melt_and_drop_accounting_columns)
.pipe(melt_and_drop_engineering_columns)
)
This is the test file stubbed out
#test_main.py
from pytest import fixture
import main_script as main
import pandas as pd
#fixture(scope='session')
def df_from_csv()
return pd.load_csv('database_dump.csv')
#fixture
def df_copy(df_from_csv):
df = df_from_csv.copy()
return df
def test_expected_flag_data_for_the_allocation_methodology(df_copy)
df = df_copy
node_to_test = df.pipe(main.flag_data_for_the_allocation_methodology)
assert True
def test_convert_repeating_values_to_nan(df_copy)
df = df_copy
node_to_test = df.pipe(main.flag_data_for_the_allocation_methodology).pipe(main.convert_repeating_values_to_nan)
assert True
def test_melt_and_drop_accounting_columns(df_copy)
df = df_copy
node_to_test = (df
.pipe(main.flag_data_for_the_allocation_methodology)
.pipe(main.convert_repeating_values_to_nan)
.pipe(main.melt_and_drop_accounting_columns))
assert True
def test_melt_and_drop_engineering_columns(df_copy)
df = df_copy
node_to_test = (df
.pipe(main.flag_data_for_the_allocation_methodology)
.pipe(main.convert_repeating_values_to_nan)
.pipe(main.melt_and_drop_accounting_columns)
.pipe(main.melt_and_drop_engineering_columns))
assert True
def test_process_csv_to_tiny_format(df_from_csv):
df = df_from_csv.copy()
tiny_data = main.process_csv_to_tiny_format(df)
assert True
I am sorry, I am aware the title is somewhat fuzzy.
Context
I am using a Dataframe to keep track of files because pandas DataFrame features several relevant functions to do all kind of filtering a dict cannot do, with loc, pd.IndexSlice, .index, .columns, pd.MultiIndex...
Ok, so this may not appear as the best choice for expert developers (which I am not), but all these functions have been so much handy that I have come to use a DataFrame for this.
And cherry on the cake, __repr__ of a MultiIndex Dataframe is just perfect when I want to know what is inside my file list.
Quick introduction to Summary class, inheriting from DataFrame
Because my DataFrame, that I call 'Summary', has some specific functions, I would like to make it a class, inheriting from pandas DataFrame class.
It also has 'fixed' MultiIndexes, for both rows and columns.
Finally, because my Summary class is defined outside the Store class which is actually managing file organization, Summary class needs a function from Store to be able to retrieve file organization.
Questions
Trouble with pd.DataFrame is (AFAIK) you cannot append rows without creating a new DataFrame.
As Summary has a refresh function so that it can recreate itself by reading folder content, a refresh somehow 'reset' the 'Summary' object.
To manage Summary refresh, I have come up with a first code (not working) and finally a second one (working).
import pandas as pd
import numpy as np
# Dummy function
def summa(a,b):
return a+b
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
self = pd.DataFrame(values, index=rmidx, columns=self.columns)
ex1 = DatF1(summa)
In [10]: ex1.meth(3,4)
Out[10]: 7
ex1.refresh()
In [11]: ex1
Out[11]: Empty DatF1
Columns: [(Index, First), (Index, Last)]
Index: []
After refresh(), ex1 is still empty. refresh has not worked correctly.
# Works
class DatF2(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
super().__init__(values, index=rmidx, columns=self.columns)
ex2 = DatF2(summa)
In [10]: ex2.meth(3,4)
Out[10]: 7
ex2.refresh()
In [11]: ex2
Out[11]: Index
First Last
Component Interval
Comp1 1h 2020-02-10 08:00:00 2020-02-10 08:00:00
1W 2020-02-11 08:00:00 2020-02-12 08:00:00
This code works!
I have 2 questions:
why the 1st code is not working? (I am sorry, this is maybe obvious, but I am completely ignorant why it does not work)
is calling super().__init__ in my refresh method acceptable coding practise? (or rephrased differently: is it acceptable to call super().__init__ in other places than in __init__ of my subclass?)
Thanks a lot for your help and advice. The world of class inheritance is for me quite new, and the fact that DataFrame content cannot be directly modified, so to say, seems to me to make it a step more difficult to handle.
Have a good day,
Bests,
Error message when adding a new row
import pandas as pd
import numpy as np
# Dummy function
def new_rows():
return [['Comp1','Comp1'],['1h','1W']]
# Does not work
class DatF1(pd.DataFrame):
def __init__(self,meth,data=None):
cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
names=['Component','Interval'])
super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
self.meth=meth
def refresh(self):
values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
[pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
rmidx = self.meth()
self[rmidx] = values
ex1 = DatF1(new_rows)
ex1.refresh()
KeyError: "None of [MultiIndex([('Comp1', 'Comp1'),\n ( '1h', '1W')],\n names=['Component', 'Interval'])] are in the [index]"
Answers to your questions
why the 1st code is not working?
You are trying to call the class you've inherited from. Honestly, I don't know what's happening exactly in your case. I assumed this would produce an error but you got an empty dataframe.
is calling super().__init__ in my refresh method acceptable coding practise?
Maybe a legitimate use case exists for calling super().__init__ outside the __init__() method. But your case is not one of them. You have already inherited evertyhing from in your __init__() . Why use it again?
A better solution
The solution to your problem is unexpectedly simple. Because you can append a row to a Dataframe:
df['new_row'] = [value_1, value_2, ...]
Or in your case with an MultiIndex (see this SO post):
df.loc[('1h', '1W'), :] = [pd.Timestamp('2020/02/10 8:00'), pd.Timestamp('2020/02/10 8:00')]
Best practice
You should not inherit from pd.DataFrame. If you want to extend pandas use the documented API.
Suppose I have the following code, to generate a dummy dask dataframe:
import pandas as pd
import dask.dataframe as dd
pandas_dataframe = pd.DataFrame({'A' : [0,500,1000], 'B': [-100, 200, 300] , 'C' : [0,0,1.0] } )
test_data_frame = dd.from_pandas( pandas_dataframe, npartitions= 1 )
Ideally I would like to know what is the recommended way to add another column to the data frame, computing the column content through a rolling window, in a lazy fashion.
I came up with the following approach:
import numpy as np
import dask.delayed as delay
#delay
def coupled_operation_example(dask_dataframe,
list_of_input_lbls,
fcn,
window_size,
init_value,
output_lbl):
def preallocate_channel_data(vector_length, first_components):
vector_out = np.zeros(len(dask_dataframe))
vector_out[0:len(first_components)] = first_components
return vector_out
def create_output_signal(relevant_data, fcn, window_size , initiated_vec):
## to be written; fcn would be a fcn accepting the sliding window
initiatied_vec = preallocate_channel_data(len(dask_dataframe, init_value))
relevant_data = dask_dataframe[list_of_input_lbls]
my_output_signal = create_output_signal(relevant_data, fcn, window_size, initiated_vec)
I was writing this, convinced that dask dataframe would allow me some slicing: they do not. So, my first option would be to extract the columns involved in the computations as numpy arrays, but so they would be eagerly evaluated. I think the penalty in performance would be significant. At the moment I create dask dataframes from h5 data, using h5py: so everything is lazy, until I write output files.
Up to now I was processing only data on a certain row; so I had been using:
test_data_frame .apply(fcn, axis =1, meta = float)
I do not think there is an equivalent functional approach for rolling windows; am I right? I would like something like Seq.windowed in F# or Haskell. Any suggestion highly appreciated.
I have tried to solve it through a closure. I will post benchmarks on some data, as soon as I have finalized the code. For now I have the following toy example, which seems to work: since dask dataframe's apply methods seems to be preserving the row order.
import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30
df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 2)), columns=list('AB'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )
def sumPrevious( previousState ) :
def getValue(row):
nonlocal previousState
something = row['A'] - previousState
previousState = row['A']
return something
return getValue
given_func = sumPrevious(1 )
out = my_data_frame.apply(given_func, axis = 1 , meta = float)
df['computed'] = out.compute()
Now the bad news, I have tried to abstract it out, passing the state around and using a rolling window of any width, through this new function:
def generalised_coupled_computation(previous_state , coupled_computation, previous_state_update) :
def inner_function(actual_state):
nonlocal previous_state
actual_value = coupled_computation(actual_state , previous_state )
previous_state = previous_state_update(actual_state, previous_state)
return actual_value
return inner_function
Suppose we initialize the function with:
init_state = df.loc[0]
coupled_computation = lambda act,prev : act['A'] - prev['A']
new_update = lambda act, prev : act
given_func3 = generalised_coupled_computation(init_state , coupled_computation, new_update )
out3 = my_data_frame.apply(given_func3, axis = 1 , meta = float)
Try to run it and be ready for surprises: the first element is wrong, possibly some pointer's problems, given the odd result. Any insight?
Anyhow, if one passes primitive types, it seems to function.
Update:
the solution is in using copy:
import copy as copy
def new_update(act, previous):
return copy.copy(act)
Now the functions behaves as expected; of course it is necessary to adapt the function updates and the coupled computation function if one needs a more coupled logic
I'm looking to extend a Panda's DataFrame, creating an object where all of the original DataFrame attributes/methods are in tact, while making a few new attributes/methods available. I also need the ability to convert (or copy) objects that are already DataFrames to my new class. What I have seems to work, but I feel like I might have violated some fundamental convention. Is this the proper way of doing this, or should I even be doing it in the first place?
import pandas as pd
class DataFrame(pd.DataFrame):
def __init__(self, df):
df.__class__ = DataFrame # effectively 'cast' Pandas DataFrame as my own
the idea being I could then initialize it directly from a Pandas DataFrame, e.g.:
df = DataFrame(pd.read_csv(path))
I'd probably do it this way, if I had to:
import pandas as pd
class CustomDataFrame(pd.DataFrame):
#classmethod
def convert_dataframe(cls, df):
df.__class__ = cls
return df
def foo(self):
return "Works"
df = pd.DataFrame([1,2,3])
print(df)
#print(df.foo()) # Will throw, since .foo() is not defined on pd.DataFrame
cdf = CustomDataFrame.convert_dataframe(df)
print(cdf)
print(cdf.foo()) # "Works"
Note: This will forever change the df object you pass to convert_dataframe:
print(type(df)) # <class '__main__.CustomDataFrame'>
print(type(cdf)) # <class '__main__.CustomDataFrame'>
If you don't want this, you could copy the dataframe inside the classmethod.
If you just want to add methods to a DataFrame just monkey patch before you run anything else as below.
>>> import pandas
>>> def foo(self, x):
... return x
...
>>> foo
<function foo at 0x00000000009FCC80>
>>> pandas.DataFrame.foo = foo
>>> bar = pandas.DataFrame()
>>> bar
Empty DataFrame
Columns: []
Index: []
>>> bar.foo(5)
5
>>>
if __name__ == '__main__':
app = DataFrame()
app()
event
super(DataFrame,self).__init__()
First thing first, the title is very unclear, however nothing better sprang to my mind. I'll ellaborate the problem in more detail.
I've found myself doing this routine a lot with pandas dataframes. I need to work for a while with only the part(some columns) of the DataFrame and later I want to add those columns back. The an idea came to my mind = Context Managers. But I am unable to come up with the correct implementation (if there is any..).
import pandas as pd
import numpy as np
class ProtectColumns:
def __init__(self, df, protect_cols=[]):
self.protect_cols = protect_cols
# preserve a copy of the part we want to protect
self.protected_df = df[protect_cols].copy(deep=True)
# create self.df with only the part we want to work on
self.df = df[[x for x in df.columns if x not in protect_cols]]
def __enter__(self):
# return self, or maybe only self.df?
return self
def __exit__(self, *args, **kwargs):
# btw. do i need *args and **kwargs here?
# append the preserved data back to the original, now changed
self.df[self.protect_cols] = self.protected_df
if __name__ == '__main__':
# testing
# create random DataFrame
df = pd.DataFrame(np.random.randn(6,4), columns=list("ABCD"))
# uneccessary step
df = df.applymap(lambda x: int(100 * x))
# show it
print(df)
# work without cols A and B
with ProtectColumns(df, ["A", "B"]) as PC:
# make everything 0
PC.df = PC.df.applymap(lambda x: 0)
# this prints the expected output
print(PC.df)
However, say I don't want to use PC.df onwards, but df. I could just do df = PC.df, or make a copy inside with or after that. But is is possible to handle this inside e.g. the __exit__ method?
# unchanged df
print(df)
with ProtectColumns(df, list("AB")) as PC:
PC.applymap(somefunction)
# df is now changed
print(df)
Thanks for any ideas!