calling apply() on an empty pandas DataFrame - python

I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)

You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')

Related

How to use apply function in a dataframe

So i have a dataframe like this and i am trying to check correlation between the columns with isbigger function which controls the values.
df = pd.read_csv('2020.csv')
def isbigger(x):
if x > 7:
return True
return False
It works fine with when i pass in a single column name like
df['Ladder score'].apply(isbigger)
However when i get theese dataframe's correlation and try to apply it does not works like this
df.drop(axis=1,columns = ['Country name','Regional indicator']).apply(isbigger).corr(method = 'spearman')
I did even dropped the string columns but not still works, how can i apply an entire dataframe ?
You don't need apply, just use the usual boolean operators here:
df.drop(['Country name', 'Regional indicator'], axis=1).gt(7).corr()
The apply() function is used to invoke a python function on values of Series.
Syntax:
Series.apply(self, func, convert_dtype=True, args=(), **kwds)

AssertionError when comparing pd DataFrame

I'm developing a test for a function that I created. My function returns a pandas DataFrame and my test consists in comparing it with a csv file that is stored. I'm using the following script to do so. When I run it, I get AssertionError with no other message.
rates_over = get_rates_over(args)
gabarito = pd.read_csv(f'{ROOT_DIR}/data/static/rates_over_teste.csv', parse_dates=['date'])
assert rates_over.equals(gabarito)
But I suspected that my function was good, so I did the following and it didn't print anything, showing that my intuition was right. What is happening?
for index, row in gabarito.iterrows():
if not row.equals(rates_over.iloc[index]):
print('Not equal!')
EDIT: As suggested by #gallen, here is a print for type and head of both gabarito and Rates_over.
A DataFrame is never equal to a Series.
pd.DataFrame.equals
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
It is meant to compare a DataFrame with a DataFrame, or a Series with a Series, not a mixture of a Series with a DataFrame.
A Series and a DataFrame have entirely different dimensionality.
import pandas as pd
df = pd.DataFrame({'foo': [1,2,3]})
s = df['foo']
print(df.shape)
#(3, 1)
print(s.shape)
#(3,)
The first check in the equals method is to check the dimensionality, so it quickly returns False without ever checking the data.
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
#...
len(s._data.axes)
#1
len(df._data.axes)
#2
If you are certain your DataFrame only has a single column, then you can squeeze it before comparing with your Series.
df.squeeze().equals(s)
#True
Alternatively convert your Series to a DataFrame using the Series name.
df.equals(s.to_frame(s.name))
#True

Using the apply function to pandas dataframe with arguments

I created a function to take a column of a string datatype and ensure the first item in the string is always capitalized. Here is my function:
def myfunc(df, col):
transformed_df = df[col][0].capitalize() + df[col][1:]
return transformed_df
Using my function in my column of interest in my pandas dataframe:
df["mycol"].apply(myfunc)
I don't know why I get this error: TypeError: myfunc() missing 1 required positional argument: 'col'
Even adding axis to indicate that it should treat it column wise. I believe I am already passing my arguments why do I still need to specify col again? Correct me if I am wrong?
Your input is highly appreciated
If use Series.apply then each value of Series is processing separately, so need:
def myfunc(val):
return val[0].capitalize() + val[1:]
If want use pandas strings functions:
df["mycol"].str[0].str.capitalize() + df["mycol"].str[1:]
If want pass to function:
def myfunc(col):
return col.str[0].str.capitalize() + col.str[1:]
Then use Series.pipe for processing Series:
df["mycol"].pipe(myfunc)
Or:
myfunc(df["mycol"])

How to compute hash of all the columns in Pandas Dataframe?

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.
The current code is:
df["row_hash"] = df["row_hash"].apply(self.hash_string)
The function self.hash_string is:
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.
The file that I am reading is(the first 10 rows):
16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026
The col names are: col_test_1, col_test_2, .... , col_test_11
You can create a new column, which is concatenation of all others:
df['new'] = df.astype(str).values.sum(axis=1)
And then apply your hash function on it
df["row_hash"] = df["new"].apply(self.hash_string)
or this one-row should work:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)
However, not sure if you need a separate function here, so:
df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())
You can use apply twice, first on the row elements then on the result:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)
Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:
df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

Pandas: apply a function with columns and a variable as argument

I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)

Categories

Resources