How to overwrite the object representation of the index in pandas - python

I am using Enums as key in pandas. Below is a small example of a dataframe which will be converted to json.
[IN]
# coding=utf-8
# Written in python 3.7
# pandas==0.23.4
from enum import unique, Enum
import pandas as pd
#unique
class DEMO(Enum):
FIRST = "hello"
SECOND = "world"
df = pd.DataFrame()
df[DEMO.FIRST] = pd.Series([1,2])
df[DEMO.SECOND] = pd.Series([1,2])
print(df.to_json())
[OUT]
{"{"name":"FIRST"}":{"0":1,"1":2},"{"name":"SECOND"}":{"0":1,"1":2}}
What I would like to have is that the Enum is not represented as an object defined via the function __dir__(self), but instead as string containing the value equivalent to string constants:
[OUT]
{"hello":{"0":1,"1":2},"world":{"0":1,"1":2}}
Is this possible without using DEMO.FIRST.value or DEMO.SECOND.value as indices?

You need the value attribute of the Enums. Then one possibility would be using a lambda with df.rename.
df.rename(lambda x: x.value, axis=1, copy=False).to_json()
# Out '{"hello":{"0":1,"1":2},"world":{"0":1,"1":2}}'

I found another sollution which works pretty fine even if the enum iss more complex or consists of multiple datatypes.
# coding=utf-8
# Written in python 3.7
# pandas==0.23.4
from enum import unique, Enum
import pandas as pd
class Complex:
name: str
type: str
def __init__(self, name: str, type: str):
self.name = name
self.type = type
def __str__(self) -> str:
return self.name
#unique
class DEMO(str, Enum):
FIRST = Complex("Hello", "Siebzig")
SECOND = Complex("World", "Zehn")
df = pd.DataFrame()
df[DEMO.FIRST] = pd.Series([1, 2])
df[DEMO.SECOND] = pd.Series([1, 2])
print(df.to_json())
will produce the output
{"Hello":{"0":1,"1":2},"World":{"0":1,"1":2}}
The important change was that I added str as super before Enum.
This is even pretty simple to use with dynamic typed ENUM contents, as long as these have a string representation (def __str__(self) -> str:). The Enum class will automatically check the string serialization of all the members to check if those are unique whithout the need to overwrite __hash__(self)

Related

Python - Dataclass: load attribute value from a dictionary containing an invalid name

Unfortunately I have to load a dictionary containing an invalid name (which I can't change):
dict = {..., "invalid-name": 0, ...}
I would like to cast this dictionary into a dataclass object, but I can't define an attribute with this name.
from dataclasses import dataclass
#dataclass
class Dict:
...
invalid-name: int # can't do this
...
The only solution I could find is to change the dictionary key into a valid one right before casting it into a dataclass object:
dict["valid_name"] = dict.pop("invalid-name")
But I would like to avoid using string literals...
Is there any better solution to this?
One solution would be using dict-to-dataclass. As mentioned in its documents it has two options:
1.passing dictionary keys
It's probably quite common that your dataclass fields have the same names as the dictionary keys they map to but in case they don't, you can pass the dictionary key as the first argument (or the dict_key keyword argument) to field_from_dict:
#dataclass
class MyDataclass(DataclassFromDict):
name_in_dataclass: str = field_from_dict("nameInDictionary")
origin_dict = {
"nameInDictionary": "field value"
}
dataclass_instance = MyDataclass.from_dict(origin_dict)
>>> dataclass_instance.name_in_dataclass
"field value"
Custom converters
If you need to convert a dictionary value that isn't covered by the defaults, you can pass in a converter function using field_from_dict's converter parameter:
def yes_no_to_bool(yes_no: str) -> bool:
return yes_no == "yes"
#dataclass
class MyDataclass(DataclassFromDict):
is_yes: bool = field_from_dict(converter=yes_no_to_bool)
dataclass_instance = MyDataclass.from_dict({"is_yes": "yes"})
>>> dataclass_instance.is_yes
True
The following code allow to filter the nonexistent keys :
import dataclasses
#dataclasses.dataclass
class ClassDict:
valid-name0: str
valid-name1: int
...
dict = {..., "invalid-name": 0, ...}
dict = {k:v for k,v in dict.items() if k in tuple(e.name for e in dataclasses.fields(ClassDict).keys())}
However, I'm sure there should be a better way to do it since this is a bit hacky.
I would define a from_dict class method anyway, which would be a natural place to make the change.
#dataclass
class MyDict:
...
valid_name: int
...
#classmethod
def from_dict(cls, d):
d['valid_name'] = d.pop('invalid-name')
return cls(**d)
md = MyDict.from_dict({'invalid-name': 3, ...})
Whether you should modify d in place or do something to avoid unnecessary copies is another matter.
Another option could be to use the dataclass-wizard library, which is likewise a de/serialization library built on top of dataclasses. It should similarly support custom key mappings, as needed in this case.
I've also timed it with the builtin timeit module, and found it to be (on average) about 5x faster than a solution with dict_to_dataclass. I've added the code I used for comparison below.
from dataclasses import dataclass
from timeit import timeit
from typing_extensions import Annotated # Note: in Python 3.9+, can import this from `typing` instead
from dataclass_wizard import JSONWizard, json_key
from dict_to_dataclass import DataclassFromDict, field_from_dict
#dataclass
class ClassDictWiz(JSONWizard):
valid_name: Annotated[int, json_key('invalid-name')]
#dataclass
class ClassDict(DataclassFromDict):
valid_name: int = field_from_dict('invalid-name')
my_dict = {"invalid-name": 0}
n = 100_000
print('dict-to-dataclass: ', round(timeit('ClassDict.from_dict(my_dict)', globals=globals(), number=n), 3))
print('dataclass-wizard: ', round(timeit('ClassDictWiz.from_dict(my_dict)', globals=globals(), number=n), 3))
i1, i2 = ClassDict.from_dict(my_dict), ClassDictWiz.from_dict(my_dict)
# assert we get the same result with both approaches
assert i1.__dict__ == i2.__dict__
Results, on my Mac OS X laptop:
dict-to-dataclass: 0.594
dataclass-wizard: 0.098

Make a Union of strings to be used as possible dictionary keys

I have some Python 3.7 code and I am trying to add types to it. One of the types I want to add is actually an Union of several possible strings:
from typing import Union, Optional, Dict
PossibleKey = Union["fruits", "cars", "vegetables"]
PossibleType = Dict[PossibleKey, str]
def some_function(target: Optional[PossibleType] = None):
if target:
all_fruits = target["fruits"]
print(f"I have {all_fruits}")
The problem here is that Pyright complains about PossibleKey. It says:
"fruits is not defined"
I would like to get Pyright/Pylance to work.
I have checked the from enum import Enum module from another SO answer, but if I try that I end up with more issues since I am actually dealing with a Dict[str, Any] and not an Enum.
What is the proper Pythonic way of representing my type?
"fruits" is not a type (hint), but Literal["fruits"] is.
from typing import Union, Literal
PossibleKey = Union[Literal["fruits"], Literal["cars"], Literal["vegetables"]]
or the much shorter version,
PossibleKey = Literal["fruits", "cars", "vegetables"]
Or, as you mentioned, define an Enum populated by the three values.
from enum import Enum
class Key(Enum):
Fruits = "fruits"
Cars = "cars"
Vegetables = "vegetables"
def some_function(target: Optional[PossibleType] = None):
if target:
all_fruits = target[Key.Fruits]
print(f"I have {all_fruits}")
(However, just because target is not None doesn't necessarily mean it actually has "fruits" as a key, only that doesn't have a key other than Key.Fruits, Key.Cars, or Key.Vegetables.)
Pyright error disappears if you define PossibleKey as Enum as below.
This requires only one line change to the original code.
If there is some issue with using Enum, please elaborate on that.
from typing import Union, Optional, Dict
from enum import Enum
PossibleKey = Enum("PossibleKey", ["fruits", "cars", "vegetables"])
PossibleType = Dict[PossibleKey, str]
def some_function(target: Optional[PossibleType] = None):
if target:
all_fruits = target["fruits"]
print(f"I have {all_fruits}")

How can I instantiate a new dataclass instance in Python without supplying parameters?

I want to create a data class instance and supply values later.
How can I do this?
def create_trade_data():
trades = []
td = TradeData()
td.Symbol='New'
trades.append(td)
return trades
DataClass:
from dataclasses import dataclass
#dataclass
class TradeData:
Symbol : str
ExecPrice : float
You have to make the attributes optional by giving them a default value None
from dataclasses import dataclass
#dataclass
class TradeData:
Symbol: str = None
ExecPrice: float = None
Then your create_trade_data function would return
[TradeData(Symbol='New', ExecPrice=None)]
Now, I chose None as the default value to indicate a lack of content. Of course, you could choose more sensible defaults like in the other answer.
from dataclasses import dataclass
#dataclass
class TradeData:
Symbol : str = ''
ExecPrice : float = 0.0
With the = operator you can assign default values.
There is the field method which is used for mutable values, like list.

Accessing python namedtuple _fields from other modules

I want to be able to get the length of the _fields member of a namedtuple from another module. However, it is flagged as protected.
The workaround I have is as follows:
MyTuple = namedtuple(
'MyTuple',
'a b'
)
"""MyTuple description
Attributes:
a (float): A descrip
b (float): B descrip
"""
NUM_MY_TUPLE_FIELDS = len(MyTuple._fields)
Then I import NUM_MY_TUPLE_FIELDS from the external module.
I was trying to find a way to make the functionality part of the class, such as to extend the namedtuple with a __len__ method. Is there a more pythonic way to get the number of fields in a namedtuple from an external module?
Updated to show the autodoc comments. The protected warning is seen in PyCharm. Originally, in the external module I simply imported MyTuple, then used:
x = len(MyTuple._fields)
I tried the following suggestion and thought it was going to work, but I get the following: TypeError: object of type 'type' has no len().
class MyTuple(typing.MyTuple):
a: float
b: float
"""MyTuple doc
Attributes:
a (float): A doc
b (float): B doc
"""
def __len__(self) -> int:
return len(self._fields)
fmt_str = f"<L {len(MyTuple)}f" # for struct.pack usage
print(fmt_str)
you can use inheritance:
class MyTuple(namedtuple('MyTuple', 'a b c d e f')):
"""MyTuple description
Attributes:
a (float): A description
...
"""
#property
def fields(self):
# _fields is a class level attribute and available via
# MyTuple._fields from external modules
return self._fields
def __len__(self):
# your implementation if you need it
return len(self._fields)
or use typing.NamedTuple if you are using python 3.5+
class MyTuple(typing.NamedTuple):
a: int
# other fields
One way is to use inspect.signature and just count how many parameters the __new__ method requires:
import inspect
n_fields = len(inspect.signature(NTClass).parameters)
This works because typing.NamedTuple disallows overriding the __new__ method, and that is unlikely to change due to the way it is implemented:
>>> import inspect
>>> from typing import NamedTuple
>>> class NTClass(NamedTuple):
... x: int
... y: float
...
>>> len(inspect.signature(NTClass).parameters)
2
It also works for the old collections.namedtuple:
>>> from collections import namedtuple
>>> NTClass = namedtuple("NTClass", "x y")
>>> len(inspect.signature(NTClass).parameters)
2

Use from_dict() to initialize a subclass of pandas DataFrame

I know that inheritance is not the simplest alternative when using pandas, but I'm curious as how to obtain the result I wish for.
Say I have a function that from a string returns a dictionary (the string could be a path, the name of a collection...):
def str_to_dict(string):
...
dic = str_to_dict(s1)
dic
>>> {'col_1' : ['a','b',...], 'col2': [1, 2, ...]
What I want to do is to create a subclass of pandas.DataFrame that would contain the data of dic while being initialized by a string using the method above and retain the string as attribute.
I know that simply passing a dictionary into pandas.DataFrame would work for some cases, but I might need to change the orientation (keys being the index instead of the columns names), so I wanted to use the from_dict constructor to get my DataFrame.
Here is my work on it:
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
# Does not work, throws a RecursionError
# It is because __init__ is used with the from_dict constructor and calls itself
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
self.from_dict(str_to_dict(string)) # Here I could add any option needed
Once again, I know there are alternatives to inheritance and I might go with composition to carry on on my project, but I am just curious on how could it be possible to make it work
The reason why what you are trying to doesn't work is elaborated here:
https://github.com/pandas-dev/pandas/issues/2859
And this won't work because it does not return an instance of your
subclass. (Bunch of issues here):
# Works but only if you want the keys of the dictionary to be the columns
class MySubClass(pandas.DataFrame):
def __init__(self, string):
self.my_string_attribute = string
dic = str_to_dict(string)
pandas.DataFrame.__init__(dic)
So what you can do is add capabilities to pd.DataFrame class like this:
import ast
def str_to_dict(string):
return ast.literal_eval(string)
class MySubClass(pd.DataFrame):
def from_str(self, string):
df_obj = super().from_dict(str_to_dict(string))
df_obj.my_string_attribute = string
return df_obj
data = "{'col_1' : ['a','b'], 'col2': [1, 2]}"
obj = MySubClass().from_str(data)
type(obj)
# __main__.MySubClass
obj.my_string_attribute
# "{'col_1' : ['a','b'], 'col2': [1, 2]}"

Categories

Resources