I'm trying to figure out how Pandas manages to create new object members on the fly. For example, if you do this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
You can immediately do this:
df.col1
and get the contents of col1. How does Pandas create the col1 member on the fly?
Thanks.
Relevant code in the repository that checks for a dictionary input:
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
if data is None:
data = {}
# Some other if-statements that check data types...
elif isinstance(data, dict): mgr = self._init_dict(data, index, columns, dtype=dtype)
Which uses_init_dict method:
def _init_dict(self, data, index, columns, dtype=None):
if columns is not None:
# Does some stuff - but this isn't your case
else:
keys = list(data.keys())
if not isinstance(data, OrderedDict):
# So this part is trying to sort keys to cols in alphabetical order
# The _try_sort function is simple, exists in pandas.core.common
keys = _try_sort(keys)
columns = data_names = Index(keys)
So the real work comes from the Index class in pandas.core.indexes.base. From there things start to get really complicated (and my understanding of what it means to explain the "how" of anything without continuing to regress until you get to machine code, started to melt away) but it's safe to say that if you give the pandas.Index class a 1-dimentional array of data it will create an object with a sliceable set and an associated data type.
Which is exactly what you're observing - you essentially fed it a bunch of keys and pandas understood that it needed to give you something back that you could access as an index (since df.col1 is just syntactic sugar for df['col1']), that you could slice (df[0:1]), and that knew its own data types.
And, of course, after asking the question, i find the answer myself.
It turns out you can use __getattr__ to achieve this. Easiest way (and the one i happen to want) is by using a dictionary, then using __getattr__ to return values from the dictionary, like so:
class test():
def __init__(self):
# super().__init__()
self.adict = {'spam' : 'eggs'}
def __getattr__(self, attr):
return self.adict[attr]
nt = test()
print(nt.spam)
__getattr__ is called when a class attribute isn't found, as is the case here. The interpreter can't find the spam attribute, so it defers this to __getattr__. Things to keep in mind:
If the key doesn't exist in the dictionary, this will raise a key error, not an attribute error.
Don't use __getattribute__, because it is called every time an attribute is called, messing up your entire class.
Thanks to everyone for the input on this.
Related
What is the difference between the two class definitions below,
class my_dict1(dict):
def __init__(self, data):
self = data.copy()
self.N = sum(self.values)
The above code results in AttributeError: 'dict' object has no attribute 'N', while the below code compiles
class my_dict2(dict):
def __init__(self, data):
for k, v in data.items():
self[k] = v
self.N = sum(self.values)
For example,
d = {'a': 3, 'b': 5}
a = my_dict1(d) # results in attribute error
b = my_dict2(d) # works fine
By assigning self itself to anything you assign self to a completely different instance than you were originally dealing with, making it no longer the "self". This instance will be of the broader type dict (because data is a dict), not of the narrower type my_dict1. You would need to do self["N"] in the first example for it to be interpreted without error, but note that even with this, in something like:
abc = mydict_1({})
abc will still not have the key "N" because a completely difference instance in __init__ was given a value for the key "N". This shows you that there's no reasonable scenario where you want to assign self itself to something else.
In regards to my_dict2, prefer composition over inheritance if you want to use a particular dict as a representation of your domain. This means having data as an instance field. See the related C# question Why not inherit from List?, the core answer is still the same. It comes down to whether you want to extend the dict mechanism vs. having a business object based on it.
I have created a class with around 100+ instance variables (as it will be used in a function to do something else).
Is there a way to translate all the instance variables; into an array list. Without manually appending each instance variable.
For instance:
class CreateHouse(object):
self.name = "Foobar"
self.title = "FooBarTest"
self.value = "FooBarValue"
# ...
# ...
# (100 more instance variables)
Is there a quicker way to append all these items to a list:
Quicker than:
theList = []
theList.append(self.name)
theList.append(self.title)
theList.append(self.value)
# ... (x100 elements)
The list would be used to perform another task, in another class/method.
The only solution (without totally rethinking your whole design - which FWIW might be an option to consider, cf my comments on your question) is to have a list of the attribute names (in the order you want them in the final list) and use getattr
class MonstruousGodClass(object):
_fields_list = ["name", "title", "value", ] #etc...
def as_list(self):
return [getattr(self, fieldname) for fieldname in self._fields_list]
Now since, as I mentionned in a comment, a list is NOT the right datatype here (from a semantical POV at least), you may want to use a dict instead - which makes the code much simpler:
import copy
def as_dict(self):
# we return a deepcopy to avoid unexpected side-effects
return copy.deepcopy(self.__dict__)
My goal here is to be able to create nested dictionaries that have attributes that hold lists of values. For example, I want to be able to do something like this:
mydict['Person 1']['height'].vals = [23, 25, 32]
mydict['Person 2']['weight'].vals = [100, 105, 110]
mydict['Person 2']['weight'].excel_locs ['A1', 'A2', 'A3']
So, for each "person" I can keep track of multiple things I might have data on, such as height and weight. The attribute I'm calling 'vals' is just a list of values for heights or weights. Importantly, I want to be able to keep track of things like where the raw data came from, such as its location in an Excel spreadsheet.
Here's what I am currently working off of:
import collections
class Vals:
def add(self, list_of_vals=[], attr_name=[]):
setattr(self, attr_name, val)
def __str__(self):
return str(self.__dict__)
mydict = collections.defaultdict(Vals)
So, I want to be able to add new keys as needed, such as mydict['Person 10']['test scores'], and then create a new attribute such as "vals" if it doesn't exist, but also append new values to it if it does.
Example of what I want to achieve:
mydict['Person 10']['test scores'].add([10, 20, 30], 'vals')
Which should allow mydictmydict['Person 10']['test scores'].vals to return [10, 20, 30]
But then I also want to be able to append to this list later on if needed, such that using .add again append to the existing list. For example, mydict['Person 10']['test scores'].add([1, 2, 3], 'vals') should then allow me to return [10, 20, 30, 1, 2, 3] from mydict['Person 10']['test scores'].vals.
I'm still very much getting used to object oriented programming, classes, etc. I am very open to better strategies that might exist for achieving my goal, which is just a nested dictionary structure which I find convenient for holding data.
If we just modify the Vals class above, it needs a way to determine whether an attribute exists. If so, create and populate it with list_of_vals, otherwise append to the existing list
Thanks!
from what I understand, you want something that you can conveniently hold data. I would actually build a class instead of a nested dictionary, because this allows for an easier way to see how everything works together (and it also helps organize everything!).
class Person(object):
"""__init__() functions as the class constructor"""
def __init__(self, name=None, testScores=None):
self.name = name
self.testScores = testScores
# make a list of class Person(s)
personList = []
personList.append(Person("Person 1",[10,25,32]))
personList.append(Person("Person 2",[22,37,45]))
print("Show one particular item:")
print(personList[0].testScores)
personList[0].testScores.append(50)
print(personList[0].testScores)
print(personList[1].name)
Basically, the Person class is what holds all of the data for an instance of it. If you want to add different types of data, you would add a parameter to the init() function like this:
def __init__(self, name=None, testScores=None, weight = None):
self.name = name
self.testScores = testScores
self.weight = weight
You can edit the values just like you would a variable.
If this isn't what you are looking for, or you are confused, I am willing to try to help you more.
I agree that using a Person class is a better solution here. It's a more abstract & intuitive way to represent a the concept, which will make your code easier to work with.
Check this out:
class Person():
# Define a custom method for retrieving attributes
def __getattr__(self, attr_name):
# If the attribute exists, setdefault will return it.
# If it doesn't yet exist, it will set it to an empty
# dictionary, and then return it.
return self.__dict__.setdefault(attr_name, {})
carolyn = Person()
carolyn.name["value"] = "Carolyn"
carolyn.name["excel_loc"] = "A1"
print(carolyn.name)
# {"value": "Carolyn", "excel_loc": "A1"}
maria = Person()
print(maria.name)
# {}
Then collecting people into a dictionary is easy:
people = {
"carolyn": carolyn,
"maria": maria
}
people["Ralph"] = Person()
people["Ralph"].name["value"] = "Ralph"
You've also made a tricky mistake in defining the add method:
def add(self, list_of_vals=[], attr_name=[]):
...
In Python, you never want to set an empty list as a default variable. Because of the way they're stored under the hood, your default variable will reference the same list every time, instead of creating a new, empty list each time.
Here's a common workaround:
def add(self, list_of_vals=None, attr_name=None):
list_of_vals = list_of_vals or []
attr_name = attr_name or []
...
I know that inheritance is not the simplest alternative when using pandas, but I'm curious as how to obtain the result I wish for.
Say I have a function that from a string returns a dictionary (the string could be a path, the name of a collection...):
def str_to_dict(string):
...
dic = str_to_dict(s1)
dic
>>> {'col_1' : ['a','b',...], 'col2': [1, 2, ...]
What I want to do is to create a subclass of pandas.DataFrame that would contain the data of dic while being initialized by a string using the method above and retain the string as attribute.
I know that simply passing a dictionary into pandas.DataFrame would work for some cases, but I might need to change the orientation (keys being the index instead of the columns names), so I wanted to use the from_dict constructor to get my DataFrame.
Here is my work on it:
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
# Does not work, throws a RecursionError
# It is because __init__ is used with the from_dict constructor and calls itself
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
self.from_dict(str_to_dict(string)) # Here I could add any option needed
Once again, I know there are alternatives to inheritance and I might go with composition to carry on on my project, but I am just curious on how could it be possible to make it work
The reason why what you are trying to doesn't work is elaborated here:
https://github.com/pandas-dev/pandas/issues/2859
And this won't work because it does not return an instance of your
subclass. (Bunch of issues here):
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
So what you can do is add capabilities to pd.DataFrame class like this:
import ast
def str_to_dict(string):
return ast.literal_eval(string)
class MySubClass(pd.DataFrame):
def from_str(self, string):
df_obj = super().from_dict(str_to_dict(string))
df_obj.my_string_attribute = string
return df_obj
data = "{'col_1' : ['a','b'], 'col2': [1, 2]}"
obj = MySubClass().from_str(data)
type(obj)
# __main__.MySubClass
obj.my_string_attribute
# "{'col_1' : ['a','b'], 'col2': [1, 2]}"
I know that sound generic, so I will try my best to explain the case.
I am getting back from a process, some data. I would like to organize this data in a format that I can interact with it, and access it.
So far I thought that I can make a simple class, which is empty, and at creation time it takes **kwargs and cycle trough them, adding them to the class.
Although I am not sure if this is the correct way to do so. Imagine the following data:
dict1={param1:a, param2:b, param3:c, param4:d} #Operation1
dict2={param2:t, param1:2, param1:r} #Operation2
dict3={param1:1, param7:2, param2:4, param4:b, param3:m} #Operation3
I would like to make a class that when creating, will take the parameters, resulting in the class attribute name taken from the parameter name, and as value, the value of that parameter:
myclass1(dict1)
myclass2.param1 return a, myclass1.param2 return b and so on
But if I want to make myclass using dict2, I can also do so:
myclass2(dict2)
myclass1.param2 return t, myclass2.param1 return 2 and so on
In this way I have a name with the parameter, and can retrieve it at later time, and at the same time I do not have to worry about how many element my data has, since the class will always get the number of parameters, and create class attributes, using the name of the key in the dictionary.
Is possible to achieve this in a simple way in python? I can use dictionary in a dictionary, but it feels utterly complicate, for big data sets.
You can do something like:
In [2]: dict1={'param1':'a', 'param2':'b', 'param3':'c', 'param4':'d'}
In [3]: class A(object):
...: def __init__(self, params):
...: for k, v in params.iteritems():
...: setattr(self, k, v)
...:
In [4]: a = A(dict1)
In [5]: a.param1
Out[5]: 'a'
In [6]: a.param2
Out[6]: 'b'