Use from_dict() to initialize a subclass of pandas DataFrame

Use from_dict() to initialize a subclass of pandas DataFrame - python

I know that inheritance is not the simplest alternative when using pandas, but I'm curious as how to obtain the result I wish for.
Say I have a function that from a string returns a dictionary (the string could be a path, the name of a collection...):
def str_to_dict(string):
...
dic = str_to_dict(s1)
dic
>>> {'col_1' : ['a','b',...], 'col2': [1, 2, ...]
What I want to do is to create a subclass of pandas.DataFrame that would contain the data of dic while being initialized by a string using the method above and retain the string as attribute.
I know that simply passing a dictionary into pandas.DataFrame would work for some cases, but I might need to change the orientation (keys being the index instead of the columns names), so I wanted to use the from_dict constructor to get my DataFrame.
Here is my work on it:
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
# Does not work, throws a RecursionError
# It is because __init__ is used with the from_dict constructor and calls itself
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
self.from_dict(str_to_dict(string)) # Here I could add any option needed
Once again, I know there are alternatives to inheritance and I might go with composition to carry on on my project, but I am just curious on how could it be possible to make it work

The reason why what you are trying to doesn't work is elaborated here:
https://github.com/pandas-dev/pandas/issues/2859
And this won't work because it does not return an instance of your
subclass. (Bunch of issues here):
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
So what you can do is add capabilities to pd.DataFrame class like this:
import ast
def str_to_dict(string):
return ast.literal_eval(string)
class MySubClass(pd.DataFrame):
def from_str(self, string):
df_obj = super().from_dict(str_to_dict(string))
df_obj.my_string_attribute = string
return df_obj
data = "{'col_1' : ['a','b'], 'col2': [1, 2]}"
obj = MySubClass().from_str(data)
type(obj)
# __main__.MySubClass
obj.my_string_attribute
# "{'col_1' : ['a','b'], 'col2': [1, 2]}"

Related

How to overwrite the object representation of the index in pandas

I am using Enums as key in pandas. Below is a small example of a dataframe which will be converted to json.
[IN]
# coding=utf-8
# Written in python 3.7
# pandas==0.23.4
from enum import unique, Enum
import pandas as pd
#unique
class DEMO(Enum):
FIRST = "hello"
SECOND = "world"
df = pd.DataFrame()
df[DEMO.FIRST] = pd.Series([1,2])
df[DEMO.SECOND] = pd.Series([1,2])
print(df.to_json())
[OUT]
{"{"name":"FIRST"}":{"0":1,"1":2},"{"name":"SECOND"}":{"0":1,"1":2}}
What I would like to have is that the Enum is not represented as an object defined via the function __dir__(self), but instead as string containing the value equivalent to string constants:
[OUT]
{"hello":{"0":1,"1":2},"world":{"0":1,"1":2}}
Is this possible without using DEMO.FIRST.value or DEMO.SECOND.value as indices?

You need the value attribute of the Enums. Then one possibility would be using a lambda with df.rename.
df.rename(lambda x: x.value, axis=1, copy=False).to_json()
# Out '{"hello":{"0":1,"1":2},"world":{"0":1,"1":2}}'

I found another sollution which works pretty fine even if the enum iss more complex or consists of multiple datatypes.
# coding=utf-8
# Written in python 3.7
# pandas==0.23.4
from enum import unique, Enum
import pandas as pd
class Complex:
name: str
type: str
def __init__(self, name: str, type: str):
self.name = name
self.type = type
def __str__(self) -> str:
return self.name
#unique
class DEMO(str, Enum):
FIRST = Complex("Hello", "Siebzig")
SECOND = Complex("World", "Zehn")
df = pd.DataFrame()
df[DEMO.FIRST] = pd.Series([1, 2])
df[DEMO.SECOND] = pd.Series([1, 2])
print(df.to_json())
will produce the output
{"Hello":{"0":1,"1":2},"World":{"0":1,"1":2}}
The important change was that I added str as super before Enum.
This is even pretty simple to use with dynamic typed ENUM contents, as long as these have a string representation (def __str__(self) -> str:). The Enum class will automatically check the string serialization of all the members to check if those are unique whithout the need to overwrite __hash__(self)

Pandas style object members

I'm trying to figure out how Pandas manages to create new object members on the fly. For example, if you do this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
You can immediately do this:
df.col1
and get the contents of col1. How does Pandas create the col1 member on the fly?
Thanks.

Relevant code in the repository that checks for a dictionary input:
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
if data is None:
data = {}
# Some other if-statements that check data types...
elif isinstance(data, dict): mgr = self._init_dict(data, index, columns, dtype=dtype)
Which uses_init_dict method:
def _init_dict(self, data, index, columns, dtype=None):
if columns is not None:
# Does some stuff - but this isn't your case
else:
keys = list(data.keys())
if not isinstance(data, OrderedDict):
# So this part is trying to sort keys to cols in alphabetical order
# The _try_sort function is simple, exists in pandas.core.common
keys = _try_sort(keys)
columns = data_names = Index(keys)
So the real work comes from the Index class in pandas.core.indexes.base. From there things start to get really complicated (and my understanding of what it means to explain the "how" of anything without continuing to regress until you get to machine code, started to melt away) but it's safe to say that if you give the pandas.Index class a 1-dimentional array of data it will create an object with a sliceable set and an associated data type.
Which is exactly what you're observing - you essentially fed it a bunch of keys and pandas understood that it needed to give you something back that you could access as an index (since df.col1 is just syntactic sugar for df['col1']), that you could slice (df[0:1]), and that knew its own data types.

And, of course, after asking the question, i find the answer myself.
It turns out you can use __getattr__ to achieve this. Easiest way (and the one i happen to want) is by using a dictionary, then using __getattr__ to return values from the dictionary, like so:
class test():
def __init__(self):
# super().__init__()
self.adict = {'spam' : 'eggs'}
def __getattr__(self, attr):
return self.adict[attr]
nt = test()
print(nt.spam)
__getattr__ is called when a class attribute isn't found, as is the case here. The interpreter can't find the spam attribute, so it defers this to __getattr__. Things to keep in mind:
If the key doesn't exist in the dictionary, this will raise a key error, not an attribute error.
Don't use __getattribute__, because it is called every time an attribute is called, messing up your entire class.
Thanks to everyone for the input on this.

Python class that automatically appends values to existing attribute OR creates and fills new attribute

My goal here is to be able to create nested dictionaries that have attributes that hold lists of values. For example, I want to be able to do something like this:
mydict['Person 1']['height'].vals = [23, 25, 32]
mydict['Person 2']['weight'].vals = [100, 105, 110]
mydict['Person 2']['weight'].excel_locs ['A1', 'A2', 'A3']
So, for each "person" I can keep track of multiple things I might have data on, such as height and weight. The attribute I'm calling 'vals' is just a list of values for heights or weights. Importantly, I want to be able to keep track of things like where the raw data came from, such as its location in an Excel spreadsheet.
Here's what I am currently working off of:
import collections
class Vals:
def add(self, list_of_vals=[], attr_name=[]):
setattr(self, attr_name, val)
def __str__(self):
return str(self.__dict__)
mydict = collections.defaultdict(Vals)
So, I want to be able to add new keys as needed, such as mydict['Person 10']['test scores'], and then create a new attribute such as "vals" if it doesn't exist, but also append new values to it if it does.
Example of what I want to achieve:
mydict['Person 10']['test scores'].add([10, 20, 30], 'vals')
Which should allow mydictmydict['Person 10']['test scores'].vals to return [10, 20, 30]
But then I also want to be able to append to this list later on if needed, such that using .add again append to the existing list. For example, mydict['Person 10']['test scores'].add([1, 2, 3], 'vals') should then allow me to return [10, 20, 30, 1, 2, 3] from mydict['Person 10']['test scores'].vals.
I'm still very much getting used to object oriented programming, classes, etc. I am very open to better strategies that might exist for achieving my goal, which is just a nested dictionary structure which I find convenient for holding data.
If we just modify the Vals class above, it needs a way to determine whether an attribute exists. If so, create and populate it with list_of_vals, otherwise append to the existing list
Thanks!

from what I understand, you want something that you can conveniently hold data. I would actually build a class instead of a nested dictionary, because this allows for an easier way to see how everything works together (and it also helps organize everything!).
class Person(object):
"""__init__() functions as the class constructor"""
def __init__(self, name=None, testScores=None):
self.name = name
self.testScores = testScores
# make a list of class Person(s)
personList = []
personList.append(Person("Person 1",[10,25,32]))
personList.append(Person("Person 2",[22,37,45]))
print("Show one particular item:")
print(personList[0].testScores)
personList[0].testScores.append(50)
print(personList[0].testScores)
print(personList[1].name)
Basically, the Person class is what holds all of the data for an instance of it. If you want to add different types of data, you would add a parameter to the init() function like this:
def __init__(self, name=None, testScores=None, weight = None):
self.name = name
self.testScores = testScores
self.weight = weight
You can edit the values just like you would a variable.
If this isn't what you are looking for, or you are confused, I am willing to try to help you more.

I agree that using a Person class is a better solution here. It's a more abstract & intuitive way to represent a the concept, which will make your code easier to work with.
Check this out:
class Person():
# Define a custom method for retrieving attributes
def __getattr__(self, attr_name):
# If the attribute exists, setdefault will return it.
# If it doesn't yet exist, it will set it to an empty
# dictionary, and then return it.
return self.__dict__.setdefault(attr_name, {})
carolyn = Person()
carolyn.name["value"] = "Carolyn"
carolyn.name["excel_loc"] = "A1"
print(carolyn.name)
# {"value": "Carolyn", "excel_loc": "A1"}
maria = Person()
print(maria.name)
# {}
Then collecting people into a dictionary is easy:
people = {
"carolyn": carolyn,
"maria": maria
}
people["Ralph"] = Person()
people["Ralph"].name["value"] = "Ralph"
You've also made a tricky mistake in defining the add method:
def add(self, list_of_vals=[], attr_name=[]):
...
In Python, you never want to set an empty list as a default variable. Because of the way they're stored under the hood, your default variable will reference the same list every time, instead of creating a new, empty list each time.
Here's a common workaround:
def add(self, list_of_vals=None, attr_name=None):
list_of_vals = list_of_vals or []
attr_name = attr_name or []
...

Python dictionary set all values to class object

After having created a dictionary from one dataframe column as keys, I want to set all values to an instance of an object (the class serves as container for storing key statistics for each row of the original pandas dataframe).
Hence, I tried this:
class Bond:
def __init__(self):
self.totalsize = 0
self.count = 0
if __name__ == '__main__':
isin_dict = list_of_isins.set_index('isin').T.to_dict()
isin_dict = dict.fromkeys(isin_dict, Bond())
The problem is that all values in isin_dict point to the same address, ie all rows share the same Bond class object.
How could I create a dictionary with each key holding a separate class instance as value?

The reason for this is already explained here
dict.fromKeys() uses the same value for every key.
The solution is to use dictionary comprehensions or to use defaultdict from collections module.
Sample Code to use defaultdict
from collections import defaultdict
class Bond:
def __init__(self):
pass
# I have just used your variable and stored in a list
d = defaultdict(lambda : list(list_of_isins.set_index('isin').T)
for keys in d:
d[keys] = Bond()
print (d)
The reason we are passing the type dict to defaultdict is the first argument should be callable for defaultdict. Else you may get a TypeError
Alternately you may also pass a lambda expression which will make it callable

Python - Proper way of serially reassign/update class members

I have a class whose members are lists of numbers built by accumulating values from experimental data, like
class MyClass:
def __init__(self):
container1 = []
container2 = []
...
def accumulate_from_dataset(self,dataset):
for entry in dataset:
container1.append( foo (entry) )
container2.append( bar (entry) )
...
def process_accumulated_data(self):
'''called when all the data is gathered
'''
process1(container1)
process2(container2)
...
Issue: it would be beneficial if I could convert all the lists into numpy arrays.
what I tried: the simple conversion
self.container1 = np.array(self.container1)
works. Although, if I would like to consider "more fields in one shot", like
lists_to_convert = [self.container1, self.container2, ...]
def converter(lists_to_convert):
for list in lists_to_convert:
list = np.array(list)
there is not any effective change since the references to the class members are passed by value.
I am thus wondering if there is a smart approach/workaround to handle the whole conversion process.
Any help appreciated

From The Pragmatic Programmer:
Ask yourself: "Does it have to be done this way? Does it have to be done at all?
Maybe you should rethink your data structure? Maybe some dictionary or a simple list of lists would be easier to handle?
Note that in the example presented, container1 and container2 are just transformations on the initial dataset. It looks like a good place for list comprehension:
foo_data = [foo(d) for d in dataset]
# or even
foo_data = map(foo, dataset)
# or generator version
foo_data_iter = (foo(d) for d in dataset)
If you really want to operate on the instance variables as in the example, have a look at getattr and hasattr built-in functions

There isn't an easy way to do this because as you say python passes "by-reference-by-value"
You could add a to_numpy method in your class:
class MyClass:
def __init__(self):
container1 = []
container2 = []
...
def to_numpy(self,container):
list = self.__getattr__(container)
self.__setattr__(container,np.array(list))
...
And then do something like:
object = MyClass()
lists_to_convert = ["container1", "container2" ...]
def converter(lists_to_convert):
for list in lists_to_convert:
object.to_numpy(list)
But it's not very pretty and this sort of code would normally make me take a step back and think about my design.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use from_dict() to initialize a subclass of pandas DataFrame - python

Related

How to overwrite the object representation of the index in pandas

Pandas style object members

Python class that automatically appends values to existing attribute OR creates and fills new attribute

Python dictionary set all values to class object

Python - Proper way of serially reassign/update class members

Categories

Resources