how to name dataframe columns based on classes arguments? - python

Spending a huge time on this. I have a tuple of non-callable classes, all named SymbolInfo, with same attribute labels. Lets say:
In: print(my_tuple)
Out: (SymbolInfo(att_1=False, att_2=0, att_3=1.0),SymbolInfo(att_1=True, att_2=0, att_3=1.5))
My objective is to create a dataframe from this tuple. When I convert it to list, it works fine:
df = pd.DataFrame(list(my_tuple))
I get the dataframe, but I don't get the column labels, which should be the name of the classes attributes: (i.e. att_1, att_2, att_3).
The attributes names and their quantity (not values) are standardized for all classes. So I could consider any class to get it.
I've tried methods like inspect.getmembers(my_tuple[0]) and inspect.getfullargspec(my_tuple[0]).args without success. It's important to get those arguments in the same sequence that they appear.

Got this solution:
my_dict = my_tuple[0]._asdict()
my_col_list = list(my_dict.keys())

You can access the attributes (in the order they were added) with the __dict__ method, like so:
class thing():
def __init__(self, att_1, att_2, att_3):
self.z_att_1 = att_1
self.att_2 = att_2
self.att_3 = att_3
a = thing('bob', 'eve', 'alice')
b = thing('john', 'jack', 'dan')
c = (a,b)
# see the attributes
print(c[0].__dict__)
Results in.
# note that z_att_1 is first
{'z_att_1': 'bob', 'att_2': 'eve', 'att_3': 'alice'}
Now you can loop through the dictionary and pull out the keys for the attribute names.

Just create a dict from class and then create dataframe:
pd.DataFrame(list(map(lambda x: x.__dict__, my_tuple)))
And I recommend using attrs library for OOP in python.

Related

How do I get all the attributes from a class set by an imported module, then store them into Pandas columns?

Let's say I have two scripts.
#hhh.py module
class HHH:
def __init__(self, arg1, arg2):
self.arg1 = arg1
self.arg2 = arg2
self.some_variable1 = 123
self.some_variable2 = 'abc'
And a script that imported the above script.
The ultimate goal is to create a Pandas dataframe that stores variable names into one column, their values into another column (for exporting later)
import hhh
import pandas as pd
# Init some Pandas dataframe
df = pd.DataFrame(columns = ['VARIABLE', 'VALUE'])
# Set one column for HHH attribute names, other for attribute values
# So it would look something like this:
# In a dataframe:
# VARIABLE = ['arg1', 'arg2', 'some_variable1', 'some_variable2']
# VALUE = [arg1, arg2, 123, 'abc']
Essentially, I have varying variables and so many of them, so I can't hard code them. I've been trying to wrap my head around this, but I don't know what it's called -- no luck on search.
Would this be possible?
I will just provide you the insight on how to access your data, then you can code yourself the rest of it.
Every class object have a __dict__ method which stores the name of the instances in the class and their values.
a = hhh.HHH('a', 'b')
Here, we are creating an instance for the class, with arg1 as 'a' and arg2 as 'b'.
a.__dict__.keys()
Output:
['arg1', 'arg2', 'some_variable1', 'some_variable2']
a.__dict__.values()
Output:
['a', 'b', 123, 'abc']
The key of __dict__ stores the instance name and the value stores the value of it.

Store data in a class dynamically and access data as class attributes

I am trying to write a class that takes data where the dataframe IDs as strings and the values as DataFrames and create class attributes accessing the data.
I was able to write a small example of a similar class that needs the methods to be created in a static manner and return the objects as class methods but I would like to loop over the data, taking in the keys for the dfs and allow for access to each df using attributes.
minimum working example
from dataclasses import dataclass
import pandas as pd
# re-writing as dataclass
#dataclass
class Dataset:
# data container dictionary as class attribute
dict = {'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})}
def df1_id(self) -> pd.DataFrame:# class method to create as class attribute
return dict['df1_id']
def df2_id(self) -> pd.DataFrame:# same class method above
return dict['df2_id']
def df3_id(self) -> pd.DataFrame:# same class method above
return dict['df3_id']
def dataframes_as_class_attributes(self):
# store the dfs to access as class attributes
# replacing 3 methods above
return
result
datasets = Dataset()
print(datasets.df1_id())
expected result
datasets = Dataset()
print(datasets.df1_id) # class attribute created by looping through the dict object
Edit:
Similar to this: How to read the contents of a csv file into a class with each csv row as a class instance
You could use setattr like below:
from dataclasses import dataclass
import pandas as pd
#dataclass
class Dataset:
dict_ = {'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})}
def __post_init__(self):
for key, val in self.dict_.items():
setattr(self, key, val)
To avoid conflicts with python keywords put a single trailing underscore after variable name. (PEP 8)
taking in the keys for the dfs and allow for access to each df using attributes.
It seems that the only purpose of the class is to have attribute access syntax. In that case, it would be simpler to just create a namespace object.
from types import SimpleNamespace
class Dataset(SimpleNamespace):
pass
# extend it possibly
data = {
'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})
}
datasets = Dataset(**data)
Output:
>>> datasets.df1_id
col1
0 1
1 1
>>> datasets.df2_id
col2
0 2
1 2
>>> datasets.df3_id
col3
0 3
1 3

Why are multiple values incorrectly updated in my dynamically created nested dicts?

Dfs is a dict with dataframes and the keys are named like this: 'datav1_135_gl_b17'
We would like to calculate a matrix with constants. It should be possible to assign the values in the matrix according to the attributes from the df name. In this example '135' and 'b17'.
If you want code to create an example dfs, let me know, I've cut it out to more clearly state the problem.
We create a nested dict dynamically with the following function:
def ex_calc_time(dfs):
formats = []
grammaturs = []
for i in dfs:
# (...)
# format
split1 = i.split('_')
format = split1[-1]
format.replace(" ", "")
formats.append(format)
formats = list(set(formats))
# grammatur
# split1 = i.split('_')
grammatur = split1[-3]
grammatur.replace(" ", "")
grammaturs.append(grammatur)
grammaturs = list(set(grammaturs))
# END FLOOP
dict_mean_time = dict.fromkeys(formats, dict.fromkeys(grammaturs, ''))
return dfs, dict_mean_time
Then we try to fill the nested dict and change the values like this (which should be working according to similiar nested dict questions, but it doesn't). 'Nope' is updated for both keys:
ex_dict_mean_time['b17']['170'] = 'nope'
ex_dict_mean_time
{'a18': {'135': '', '170': 'nope', '250': ''},
'b17': {'135': '', '170': 'nope', '250': ''}}
I also tried creating a dataframe from ex_dict_mean_time and filling it with .loc, but that didn't work either (df remains empty). Moreover I tried this method, but I always end up with the same problem and the values are overwritten. I appreciate any help. If you have any improvements for my code please let me know, I welcome any opportunity to improve.

for loop with same dataframe on both side of the operator

I have defined 10 different DataFrames A06_df, A07_df , etc, which picks up six different data point inputs in a daily time series for a number of years. To be able to work with them I need to do some formatting operations such as
A07_df=A07_df.fillna(0)
A07_df[A07_df < 0] = 0
A07_df.columns = col # col is defined
A07_df['oil']=A07_df['oil']*24
A07_df['water']=A07_df['water']*24
A07_df['gas']=A07_df['gas']*24
A07_df['water_inj']=0
A07_df['gas_inj']=0
A07_df=A07_df[['oil', 'water', 'gas','gaslift', 'water_inj', 'gas_inj', 'bhp', 'whp']]
etc for a few more formatting operations
Is there a nice way to have a for loop or something so I don’t have to write each operation for each dataframe A06_df, A07_df, A08.... etc?
As an example, I have tried
list=[A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
for i in list:
i=i.fillna(0)
But this does not do the trick.
Any help is appreciated
As i.fillna() returns a new object (an updated copy of your original dataframe), i=i.fillna(0) will update the content of ibut not of the list content A06_df, A07_df,....
I suggest you copy the updated content in a new list like this:
list_raw = [A06_df, A07_df, A08_df, A10_df, A11_df, A12_df, A13_df, A15_df, A18_df, A19_df]
list_updated = []
for i in list_raw:
i=i.fillna(0)
# More code here
list_updated.append(i)
To simplify your future processes I would recommend to use a dictionary of dataframes instead of a list of named variables.
dfs = {}
dfs['A0'] = ...
dfs['A1'] = ...
dfs_updated = {}
for k,i in dfs.items():
i=i.fillna(0)
# More code here
dfs_updated[k] = i

Propagate pandas series metadata through joins

I'd like to be able attach metadata to the series of dataframes (specifically, the original filename), so that after joining two dataframes I can see metadata on where each of the series came from.
I see github issues regarding _metadata (here, here), including some relating to the current _metadata attribute (here), but nothing in the pandas docs.
So far I can modify the _metadata attribute to supposedly allow preservation of metadata, but get an AttributeError after the join.
df1 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df2 = pd.DataFrame(np.random.randint(0, 4, (6, 3)))
df1._metadata.append('filename')
df1[df1.columns[0]]._metadata.append('filename')
for c in df1:
df1[c].filename = 'fname1.csv'
df2[c].filename = 'fname2.csv'
df1[0]._metadata # ['name', 'filename']
df1[0].filename # fname1.csv
df2[0].filename # fname2.csv
df1[0][:3].filename # fname1.csv
mgd = pd.merge(df1, df2, on=[0])
mgd['1_x']._metadata # ['name', 'filename']
mgd['1_x'].filename # raises AttributeError
Any way to preserve this?
Update: Epilogue
As discussed here, __finalize__ cannot keep track of Series that are members of a dataframe, only independent series. So for now I'll keep track of the Series-level metadata by maintaining a dictionary of metadata attached to the dataframes. My code looks like:
def cust_merge(d1, d2):
"Custom merge function for 2 dicts"
...
def finalize_df(self, other, method=None, **kwargs):
for name in self._metadata:
if method == 'merge':
lmeta = getattr(other.left, name, {})
rmeta = getattr(other.right, name, {})
newmeta = cust_merge(lmeta, rmeta)
object.__setattr__(self, name, newmeta)
else:
object.__setattr__(self, name, getattr(other, name, None))
return self
df1.filenames = {c: 'fname1.csv' for c in df1}
df2.filenames = {c: 'fname2.csv' for c in df2}
pd.DataFrame._metadata = ['filenames']
pd.DataFrame.__finalize__ = finalize_df
I think something like this will work (and if not, pls file a bug report as this, while supported is a bit bleading edge, iow it IS possible that the join methods don't call this all the time. That is a bit untested).
See this issue for a more detailed example/bug fix.
DataFrame._metadata = ['name','filename']
def __finalize__(self, other, method=None, **kwargs):
"""
propagate metadata from other to self
Parameters
----------
other : the object from which to get the attributes that we are going
to propagate
method : optional, a passed method name ; possibly to take different
types of propagation actions based on this
"""
### you need to arbitrate when their are conflicts
for name in self._metadata:
object.__setattr__(self, name, getattr(other, name, None))
return self
DataFrame.__finalize__ = __finalize__
So this replaces the default finalizer for DataFrame with your custom one. Where I have indicated, you need to put some code which can arbitrate between conflicts. This is the reason this is not done by default, e.g. frame1 has name 'foo' and frame2 has name 'bar', what do you do when the method is __add__, what about another method?. Let us know what you do and how it works out.
This is ONLY replacing for DataFrame (and you can simply do the default action if you want), which is to propogate other to self; you can also not set anything except under special cases of method.
This method is meant to be overriden if sub-classes, that's why you are monkey patching here (rather than sub-classing which is most of the time overkill).

Categories

Resources