Using the apply function to pandas dataframe with arguments - python

I created a function to take a column of a string datatype and ensure the first item in the string is always capitalized. Here is my function:
def myfunc(df, col):
transformed_df = df[col][0].capitalize() + df[col][1:]
return transformed_df
Using my function in my column of interest in my pandas dataframe:
df["mycol"].apply(myfunc)
I don't know why I get this error: TypeError: myfunc() missing 1 required positional argument: 'col'
Even adding axis to indicate that it should treat it column wise. I believe I am already passing my arguments why do I still need to specify col again? Correct me if I am wrong?
Your input is highly appreciated

If use Series.apply then each value of Series is processing separately, so need:
def myfunc(val):
return val[0].capitalize() + val[1:]
If want use pandas strings functions:
df["mycol"].str[0].str.capitalize() + df["mycol"].str[1:]
If want pass to function:
def myfunc(col):
return col.str[0].str.capitalize() + col.str[1:]
Then use Series.pipe for processing Series:
df["mycol"].pipe(myfunc)
Or:
myfunc(df["mycol"])

Related

convert lambda function to regular function PYTHON df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)

I have this current lambda function: df["domain_count"] = df.apply(lambda row : df['domain'].value_counts()[row['domain']], axis = 1)
But I want to convert it to a regular function like this def get_domain_count() how do I do this? I'm not sure what parameters it would take in as I want to apply it to an entire column in a dataframe? The domain column will contain duplicates and I want to know how many times a domain appears in my dataframe.
ex start df:
|domain|
---
|target.com|
|macys.com|
|target.com|
|walmart.com|
|walmart.com|
|target.com|
ex end df:
|domain|count|
---|---|
|target.com|3
|macys.com|1
|target.com|3
|walmart.com|2
|walmart.com|2
|target.com|3
Please help! Thanks in advance!
You can pass the column name as a string, and the dataframe object to mutate:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame.apply(lambda row: df[col_name]...)
But better yet, you don't need to apply!
df["domain"].map(df["domain"].value_counts())
will first get the counts per unique value, and map each value in the column with that. So the function could become:
def countify(frame, col_name):
frame[f"{col_name}_count"] = frame[col_name].map(frame[col_name].value_counts())
A lambda is just an anonymous function and its usually easy to put it into a function using the lambda's own parameter list (in this case, row) and returning its expression. The challenge with this one is the df parameter that will resolve differently in a module level function than in your lambda. So, add that as a parameter to the function. The problem is that this will not be
def get_domain_count(df, row):
return df['domain'].value_counts()[row['domain']]
This can be a problem if you still want to use this function in an .apply operation. .apply wouldn't know to add that df parameter at the front. To solve that, you could create a partial.
import functools.partial
def do_stuff(some_df):
some_df.apply(functools.partial(get_domain_count, some_df))

Designing a method to construct a new dataframe from specific columns of another dataframe without variable input parameters

My problem: I want to construct a methode which conncects specific columns of a dataframe and by that designs a new datafreame. I don't want to specify at the beginning how many and which columns I want to use.
My goal is to connect the columns by column, like stacking them next to each other.
At the moment my code looks like this:
Attempt 1:
def construct_features(df, *cols):
features = pd.DataFrame()
for col in cols:
features = pd.concat(df[col])
return features
I also tried using a list:
Attempt 2:
def construct_features(df, *cols):
features = []
for col in cols:
features.append(df[col], axis=1)
return pd.DataFrame(features)
My function call looks like that:
feature_matrix = construct_features(dataframename, 'colname1', 'colname2', 'colname3')
The first attempt gives me the following error message:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
The second attempt gives me this error message:
TypeError: append() takes no keyword arguments
For the second attempt I know that the problem is axis=1. But if I leave it out, the output isn't in the desired shape. It gives me a list of size 3 and I actually have no clue what that means.
Thank you very much in advance!
If I understand what you want correctly maybe try
def construct_features(cols, df):
if isinstance(cols, list):
return df[cols].copy()
elif isinstance(cols, str):
return df[[cols]].copy()
else:
print('Cols is not a list or string')
This should take in either a list of cols or single string as well as your dataframe df and return a copy a version of the original dataframe under the expectation that the cols are in df.columns. Is this what you wanted your function to do ?

Pandas: apply a function with columns and a variable as argument

I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)

Trouble passing in lambda to apply for pandas DataFrame

I'm trying to apply a function to all rows of a pandas DataFrame (actually just one column in that DataFrame)
I'm sure this is a syntax error but I'm know sure what I'm doing wrong
df['col'].apply(lambda x, y:(x - y).total_seconds(), args=[d1], axis=1)
The col column contains a bunch a datetime.datetime objects and d1 is the earliest of them. I'm trying to get a column of the total number of seconds for each of the rows
EDIT I keep getting the following error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
I don't understand why axis is getting passed to my lambda function
EDIT 2
I've also tried doing
def diff_dates(d1, d2):
return (d1-d2).total_seconds()
df['col'].apply(diff_dates, args=[d1], axis=1)
And I get the same error
Note there is no axis param for a Series.apply call, as distinct to a DataFrame.apply call.
Series.apply(func, convert_dtype=True, args=(), **kwds)
func : function
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object
args : tuple
Positional arguments to pass to function in addition to the value
There is one for a df but it's unclear how you're expecting this to work when you're calling it on a series but you're expecting it to work on a row?

calling apply() on an empty pandas DataFrame

I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)
You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')

Categories

Resources