Trouble passing in lambda to apply for pandas DataFrame - python

I'm trying to apply a function to all rows of a pandas DataFrame (actually just one column in that DataFrame)
I'm sure this is a syntax error but I'm know sure what I'm doing wrong
df['col'].apply(lambda x, y:(x - y).total_seconds(), args=[d1], axis=1)
The col column contains a bunch a datetime.datetime objects and d1 is the earliest of them. I'm trying to get a column of the total number of seconds for each of the rows
EDIT I keep getting the following error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
I don't understand why axis is getting passed to my lambda function
EDIT 2
I've also tried doing
def diff_dates(d1, d2):
return (d1-d2).total_seconds()
df['col'].apply(diff_dates, args=[d1], axis=1)
And I get the same error

Note there is no axis param for a Series.apply call, as distinct to a DataFrame.apply call.
Series.apply(func, convert_dtype=True, args=(), **kwds)
func : function
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object
args : tuple
Positional arguments to pass to function in addition to the value
There is one for a df but it's unclear how you're expecting this to work when you're calling it on a series but you're expecting it to work on a row?

Related

How to update pandas DataFrame.drop() for Future Warning - all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only

The following code:
df = df.drop('market', 1)
generates the warning:
FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
market is the column we want to drop, and we pass the 1 as a second parameter for axis (0 for index, 1 for columns, so we pass 1).
How can we change this line of code now so that it is not a problem in the future version of pandas / to resolve the warning message now?
From the documentation, pandas.DataFrame.drop has the following parameters:
Parameters
labels: single label or list-like Index or column labels to drop.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0 Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index: single label or list-like Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
columns: single label or list-like Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
level: int or level name, optional For MultiIndex, level from which the labels will be removed.
inplace: bool, default False If False, return a copy. Otherwise, do operation inplace and return None.
errors: {‘ignore’, ‘raise’}, default ‘raise’ If ‘ignore’, suppress error and only existing labels are dropped.
Moving forward, only labels (the first parameter) can be positional.
So, for this example, the drop code should be as follows:
df = df.drop('market', axis=1)
or (more legibly) with columns:
df = df.drop(columns='market')
The reason for this warning is so that probably in future versions pandas will change the *args to **kwargs.
So that means specifying axis would be required, so try:
df.drop('market', axis=1)
As mentioned in the documentation:
**kwargs allows you to pass keyworded variable length of arguments to a function. You should use **kwargs if you want to handle named arguments in a function.
Also recently with the new versions (as of 0.21.0), you could just specify columns or index like this:
df.drop(columns='market')
See more here.

Why am I getting a 'hashable' error when combining two dataframes?

I have two DataFrames and I'm attempting to combine them as follows:
df3 = df1.combine(df2, np.mean)
However, I'm getting the following error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed.
I'm not sure I understand why I'm getting the message as by definition DataFrames are mutable?
I don't get an error if I switch to:
df3 = df1.combine(df2, np.minimum)
Is this something to do with me having NaN values in the two DataFrames? If it is then what would be the solution? Devise my own function to replicate np.mean?
Updated:
I just found np.nanmean but that gives the following error:
TypeError: 'Series' object cannot be interpreted as an integer
np.mean takes one positional argument as the input array. So you cannot and should not do
np.mean(series1, series2)
Since the command above will interpret series2 as the second argument for np.mean, which is axis. But this argument is an integer and python try to convert series2 into one, which triggers the error.
In stead, you should do this for mean:
np.mean([series1, series2])
In the other case, np.minimum is designed to do:
np.minimum(series1, series2)
and gives the minimum element-wise as expected.
TLDR For mean, you can just do:
df = (df + df2)/2

Using the apply function to pandas dataframe with arguments

I created a function to take a column of a string datatype and ensure the first item in the string is always capitalized. Here is my function:
def myfunc(df, col):
transformed_df = df[col][0].capitalize() + df[col][1:]
return transformed_df
Using my function in my column of interest in my pandas dataframe:
df["mycol"].apply(myfunc)
I don't know why I get this error: TypeError: myfunc() missing 1 required positional argument: 'col'
Even adding axis to indicate that it should treat it column wise. I believe I am already passing my arguments why do I still need to specify col again? Correct me if I am wrong?
Your input is highly appreciated
If use Series.apply then each value of Series is processing separately, so need:
def myfunc(val):
return val[0].capitalize() + val[1:]
If want use pandas strings functions:
df["mycol"].str[0].str.capitalize() + df["mycol"].str[1:]
If want pass to function:
def myfunc(col):
return col.str[0].str.capitalize() + col.str[1:]
Then use Series.pipe for processing Series:
df["mycol"].pipe(myfunc)
Or:
myfunc(df["mycol"])

Designing a method to construct a new dataframe from specific columns of another dataframe without variable input parameters

My problem: I want to construct a methode which conncects specific columns of a dataframe and by that designs a new datafreame. I don't want to specify at the beginning how many and which columns I want to use.
My goal is to connect the columns by column, like stacking them next to each other.
At the moment my code looks like this:
Attempt 1:
def construct_features(df, *cols):
features = pd.DataFrame()
for col in cols:
features = pd.concat(df[col])
return features
I also tried using a list:
Attempt 2:
def construct_features(df, *cols):
features = []
for col in cols:
features.append(df[col], axis=1)
return pd.DataFrame(features)
My function call looks like that:
feature_matrix = construct_features(dataframename, 'colname1', 'colname2', 'colname3')
The first attempt gives me the following error message:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
The second attempt gives me this error message:
TypeError: append() takes no keyword arguments
For the second attempt I know that the problem is axis=1. But if I leave it out, the output isn't in the desired shape. It gives me a list of size 3 and I actually have no clue what that means.
Thank you very much in advance!
If I understand what you want correctly maybe try
def construct_features(cols, df):
if isinstance(cols, list):
return df[cols].copy()
elif isinstance(cols, str):
return df[[cols]].copy()
else:
print('Cols is not a list or string')
This should take in either a list of cols or single string as well as your dataframe df and return a copy a version of the original dataframe under the expectation that the cols are in df.columns. Is this what you wanted your function to do ?

sort_values() got an unexpected keyword argument 'by'

for i in str_list: #str_list is a set contain some strings
df.loc[i].sort_values(by = 'XXX')
**TypeError**: sort_values() got an unexpected keyword argument 'by' ".
>>> type(df.loc[i])
>>> pandas.core.frame.DataFrame
But it works outside the for loop!
df.loc['string'].sort_values(by = 'XXX')
>>> type(df.loc['string'])
>>> pandas.core.frame.DataFrame
I'm confused.
This is because the result of the loc operator is a pandas.Series object in your case. The sort_values in this case doesn't have a keyword argument by because it can only sort the series values. Have a look at the difference in the signature when you call sort values in a pandas.DataFrame
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
and when you call sort_values in a pandas.Series
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.sort_values.html
To add to the answer,
why is it returning a series in one case and a data frame in another?
.loc function is returning a Series in the first case
for i in str_list: #str_list is a set contain some strings
df.loc[i].sort_values(by = 'XXX')
because the argument i appears only once in the DataFrame.
in the second case, the 'string' is duplicated and therefore will return a DataFrame.
df.loc['string'].sort_values(by = 'XXX')
If the 'string' argument is not duplicated then
note that there are also some differences if the argument in .loc is on a list.
for example.
df.loc['string'] -> returns a Series
df.loc[['string']] -> returns a DataFrame
Maybe in the second case you are giving ['string'] as the argument instead of 'string' ?
Hope this helps.

Categories

Resources