I have a class that looks something like:
#dataclass
class MyClass:
df1: pd.Series = None
df2: pd.Series = None
df3: pd.Series = None
df4: pd.Series = None
df5: pd.Series = None
#property
def series_mean(self) -> pd.Series:
series_mean = (
self.df1
+ self.df2
+ self.df3
+ self.df4
+ self.df5
).mean()
return series_mean
Generally this could just be done as a function alone, but for this case, let's just assume I do this.
Now, my issue now is, that none of the df's are mandatory, so I could just give it df1 and df5. In that case, the mean doesn't work due to the None in the class.
So how do I go about just using the ones that are not None ? And on top of that, is there a way to get how many of them are not None? I mean, if I wanted to do / 2 instead of .mean() if there is two that are not None.
Put them in a tuple instead and filter on None. Then you can easily both sum and and query len from the filtered tuple.
from typing import Tuple
#dataclass
class MyClass:
dfs: Tuple[pd.Series, pd.Series, pd.Series, pd.Series, pd.Series] = (None, None, None, None, None)
#property
def series_mean(self) -> pd.Series:
not_none = tuple(filter(None, self.dfs))
return sum(not_none) / len(not_none)
Assuming you actually need to keep this as five separate variables, which I doubt, you can use sum with a generator expression to filter out the none values
#property
def series_mean(self) -> pd.Series:
return sum(
df for df in
[self.df1, self.df2, self.df3, self.df4, self.df5]
if df is not None).mean
Related
I've been running into a bit of weirdness with Unions (and Optionals, of course) in Python - namely it seems that the static type checker tests properties against all member of a union, and not a member of the union (i.e. it seems overly strict?). As an example, consider the following:
import pandas as pd
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame()
df = df.fillna(df)
return df
This creates a type warning, as pd.fillna(..., inplace: Bool = False, ...) -> Optional[pd.DataFrame] (it is a None return if inplace=True). I suspect that in theory the static type checker should realize the return of the function changes depending on the arguments (as that should be known when code is written), but that's a bit beyond the point.
I have the following questions:
What is the best way to resolve this? I can think of two solutions:
i) do nothing -- which creates ugly squiggles in my code
ii) cast the return of fillna to a pd.DataFrame; my understanding is this is a informative step to the static type checker so should not cause any concerns or issues?
Let us consider that I'm writing a function f which, similarly to this, has its return types vary depending on the function call inputs, and this should be determinable before runtime. In order to avoid such errors in the future; what is the best way to go about writing this function? Would it be better to do something like a #typing.overload?
The underlying function should really be defined as an overload -- I'd suggest a patch to pandas probably
Here's what the type looks like right now:
def fillna(
self: FrameOrSeries,
value=None,
method=None,
axis=None,
inplace: bool_t = False,
limit=None,
downcast=None,
) -> Optional[FrameOrSeries]: ...
in reality, a better way to represent this is to use an #overload -- the function returns None when inplace = True:
#overload
def fillna(
self: FrameOrSeries,
value=None,
method=None,
axis=None,
inplace: Literal[True] = False,
limit=None,
downcast=None,
) -> None: ...
#overload
def fillna(
self: FrameOrSeries,
value=None,
method=None,
axis=None,
inplace: Literal[False] = False,
limit=None,
downcast=None,
) -> FrameOrSeries: ...
def fillna(
self: FrameOrSeries,
value=None,
method=None,
axis=None,
inplace: bool_t = False,
limit=None,
downcast=None,
) -> Optional[FrameOrSeries]:
# actual implementation
but assuming you can't change the underlying library you have several approaches to unpacking the union. I made a video about this specifically for re.match but I'll reiterate here since it's basically the same problem (Optional[T])
option 1: an assert indicating the expected return type
the assert tells the type checker something it doesn't know: that the type is narrower than it knows about. mypy will trust this assertion and the type will be assumed to be pd.DataFrame
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame()
ret = df.fillna(df)
assert ret is not None
return ret
option 2: cast
explicitly tell the type checker that the type is what you expect, "cast"ing away the None-ness
from typing import cast
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame()
ret = cast(pd.DataFrame, df.fillna(df))
return ret
type: ignore
the (imo) hacky solution is to tell the type checker to ignore the incompatibility, I would not suggest this approach but it can be helpful as a quick fix
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame()
ret = df.fillna(df)
return ret # type: ignore
The pandas.DataFrame.fillna method is defined as returning either DataFrame or None.
If there is a possibility that a function will return None, then this should be documented by using an Optional type hint. It would be wrong to try to hide the fact a function could return None by using a cast or a comment to ignore the warning such as:
return df # type: ignore
If function could return None, use Optional
import numpy as np
import pandas as pd
from typing import Optional
def test_dummy() -> Optional[pd.DataFrame]:
df = pd.DataFrame([np.nan, 2, np.nan, 0])
df = df.fillna(value=0)
return df
Function guaranteed not to return None, there are these options
If you can guarantee that a function will not return None, but it cannot be statically inferred by a type checker, then there are three options.
Option 1: Use an assertion to indicate that DataFrame is not None
This is the approach recommended by the mypy documentation.
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame([np.nan, 2, np.nan, 0])
df = df.fillna(value=0)
assert df is not None
return df
Option 2: Use a cast
from typing import cast
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame([np.nan, 2, np.nan, 0])
df = cast(pd.DataFrame, df.fillna(value=0))
return df
Option 3: Tell mypy to ignore the warning (not recommended)
from typing import cast
def test_dummy() -> pd.DataFrame:
df = pd.DataFrame([np.nan, 2, np.nan, 0])
df = df.fillna(value=0)
return df # type: ignore
This is a general python question. Is it possible to assign different variables to a class object and then perform different set of operations on those variables? I'm trying to reduce code but maybe this isn't how it works. For example, I'm trying to do something like this:
Edit: here is an abstract of the class and methods:
class Class:
def __init__(self, df):
self.df = df
def query(self, query):
self.df = self.df.query(query)
return self
def fill(self, filter):
self.df.update(df.filter(like=filter).mask(lambda x: x == 0).ffill(1))
return self
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
self.df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return self
def melt(self, cols, var=None, value=None):
return pd.melt(self.df, id_vars=columns, var_name=var, value_name=value)
I'm trying to use it like this:
df = pd.read_csv('data.csv')
df = Class(df)
df = df.query(query).forward_fill(include)
df_1 = df.diff(cols).melt(cols)
df_2 = df.melt(cols)
df_1 and df_2 should have different values, however they are the same as df_1. This issue is resolved if I use the class like this:
df_1 = pd.read_csv('data.csv')
df_2 = pd.read_csv('data.csv')
df_1 = Class(df_1)
df_2 = Class(df_2)
df_1 = df_1.query(query).forward_fill(include)
df_2 = df_2.query(query).forward_fill(include)
df_1 = df_1.diff(cols).melt(cols)
df_2 = df_2.melt(cols)
This results in extra code. Is there a better way to do this where you can use an object differently on different variables, or do I have to create seperate objects if I'm trying to have two variables perform separate operations and return different values?
With the return self statement in the diff- method you return the reference of the object. The same thing happens after the melt method. But in that two methods you allreadey manipulated the origin df.
Here:
1 df = pd.read_csv('data.csv')
2
3 df = Class(df)
4 df = df.query(query).forward_fill(include)
5
6 df_1 = df.diff(cols).melt(cols)
the df has the same values like df_1. I guess the melt method without other args then cols arguments only assigns col names or something like that. Subsequently df_2=df.melt(cols) would have the same result like df_2=df_1.melt(cols).
If you want to work with one object, you dont should use self.df=... in your class methods, because this changes the instance value of df. You only need to write df = ... and than return Class(df).
For example:
def diff(self, cols=None, axis=1):
diff = self.df[self.df.columns[~self.df.columns.isin(cols)]].diff(axis=axis)
df = diff.join(self.df[self.df.columns.difference(diff.columns)])
return Class(df)
Best regards
I'm running a school assignment matching algorithm using dictionaries. All the process of the algorithm is relatively efficient, except for the part where I need to export the results to a .csv.
students is a dictionary with 483.070 pairs of key-value. The key is an integer with an id, and the value is a Student class object that I create. Actually, to export results I'm using the following methods.
def parse_student_match_information(student: Student) -> int:
if student.assigned_vacancy is None:
return 0
return student.assigned_vacancy.program_id
def get_assignation_output(students: dict)-> pd.DataFrame:
result = pd.DataFrame(columns = ['Student_ID', 'Program_ID', 'Grade_ID'])
for student in students.values():
program_id = parse_student_match_information(student)
result = result.append({'Student_ID': student.id, 'Program_ID': program_id, 'Grade_ID': student.grade}, ignore_index = True)
return result.sort_values('Grade_ID')
It took more than an hour to produce this pd.DataFrame. Any suggestion is welcome!
Generally you don't want to append to a DataFrame but instead create it from an iterable, a better way would be as shown below.
def parse_student_match_information(student: Student) -> int:
if student.assigned_vacancy is None:
return 0
return student.assigned_vacancy.program_id
def get_assignation_output(students: dict) -> Iterable[dict]:
for student in students.values():
program_id = parse_student_match_information(student)
result = {'Student_ID': student.id, 'Program_ID': program_id, 'Grade_ID': student.grade}
yield result
def make_df(rows: Iterable[dict]) -> pd.DataFrame:
df = pd.DataFrame(rows, columns=['Student_ID', 'Program_ID', 'Grade_ID'])
df.sort_values(by=['Grade_ID'])
return df
This way you create the DataFrame from all the rows at once and then sort it once at the very end as opposed to each iteration. You should see improvements in terms of performance from this.
I have a defined function to slice a dataframe and do some analysis like this
def df_slice(startrow,endrow):
do something...
newdf = df[startrow,endrow]
do something...
return newdf
Normally for to analyze the first few rows a df, I can just use
df1= df_slice(0,10)
But what if I wish to slice the last 5 rows of the dataframe?
so that in the function
newdf = df[-5:]
I would not use df1= df_slice(-5,'') or just leave in blank like df1= df_slice(-5,).
What should I do?
Found the answer, just input None as the parameter.
df[startrow, endrow]
is equivalent to
df.__getitem__((startrow, endrow))
And
df[startrow:]
is equivalent to
df.__getitem__(slice(startrow, None))
Here is a sample code:
class MyCollection:
def __getitem__(self, item):
return item
my_collection = MyCollection()
print(my_collection[1, 2])
print(my_collection[1:])
output:
(1, 2)
slice(1, None, None) # start, stop, step
As you noticed, the omitting slice element means None.
So you can call
newdf = df_slice(-5, None)
reference: https://docs.python.org/3/library/functions.html#slice
I have a function that takes in some complex parameters and is expected to return a filter to be used on a pandas dataframe.
filters = build_filters(df, ...)
filtered_df = df[filters]
For example, if the dataframe has series Gender and Age, build_filters could return (df.Gender == 'M') & (df.Age == 100)
If, however, build_filters determines that there should be no filters applied, is there anything that I can return (i.e. the "identity filter") that will result in df not being filtered?
I've tried the obvious things like None, True, and even a generator that returns True for every call to next()
The closest I've come is
operator.ne(df.ix[:,0], nan)
which I think is silly, and likely going to cause bugs I can't yet forsee.
You can return slice(None). Here's a trivial demonstration:
df = pd.DataFrame([[1, 2, 3]])
df2 = df[slice(None)] # equivalent to df2 = df[:]
df2[0] = -1
assert df.equals(df2)
Alternatively, use pd.DataFrame.pipe and return df if no filters need to be applied:
def apply_filters(df):
# some logic
if not filter_flag:
return df
else:
# mask = ....
return df[mask]
filtered_df = df.pipe(apply_filters)