sort_values() got an unexpected keyword argument 'by' - python

for i in str_list: #str_list is a set contain some strings
df.loc[i].sort_values(by = 'XXX')
**TypeError**: sort_values() got an unexpected keyword argument 'by' ".
>>> type(df.loc[i])
>>> pandas.core.frame.DataFrame
But it works outside the for loop!
df.loc['string'].sort_values(by = 'XXX')
>>> type(df.loc['string'])
>>> pandas.core.frame.DataFrame
I'm confused.

This is because the result of the loc operator is a pandas.Series object in your case. The sort_values in this case doesn't have a keyword argument by because it can only sort the series values. Have a look at the difference in the signature when you call sort values in a pandas.DataFrame
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
and when you call sort_values in a pandas.Series
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.sort_values.html

To add to the answer,
why is it returning a series in one case and a data frame in another?
.loc function is returning a Series in the first case
for i in str_list: #str_list is a set contain some strings
df.loc[i].sort_values(by = 'XXX')
because the argument i appears only once in the DataFrame.
in the second case, the 'string' is duplicated and therefore will return a DataFrame.
df.loc['string'].sort_values(by = 'XXX')
If the 'string' argument is not duplicated then
note that there are also some differences if the argument in .loc is on a list.
for example.
df.loc['string'] -> returns a Series
df.loc[['string']] -> returns a DataFrame
Maybe in the second case you are giving ['string'] as the argument instead of 'string' ?
Hope this helps.

Related

How do I *avoid* passing arguments to function in Python pandas .apply function?

I am wanting to experiment with the raw=True option in the pandas apply function, as per p. 155 in High Performance Python, by Gorelick and Ozsvald. However, Python is apparently regarding the raw=True as an argument for the function I'm applying, and not for the .apply function itself! Here's a MWE:
import pandas as pd
df = pd.DataFrame(columns=('a', 'b'))
df.loc[0] = (1, 2)
df.loc[1] = (3, 4)
df['a'] = df['a'].apply(str, raw=True)
When I try to execute this, I get the following error:
TypeError: 'raw' is an invalid keyword argument for str()
The problem stays there even if I use a lambda expression:
df['a'] = df['a'].apply(lambda x: str(x), raw=True)
The problem remains if I call a custom-defined function instead of str.
How do I get Pandas to recognize that raw=True is an argument for .apply and NOT str?
Referring to the comments, I don't think these are side effects. As in the documentation stated, passing raw=True as argument, the "function receive ndarray objects", so you pass an array and convert it to a string. The result is a string like [1 3]. So you don't convert each value to a string, instead the whole Series to a string
If you write a little helper function you can see that.
def conv(col):
print(f"input values: {col}")
print(f"type input: {type(col)}\n")
return str(col)
t = df[['a']].apply(conv, raw=True)
print(f"{type(t)}:\n{t}\n")
print(f"first value: {type(t[0])}:\n{t[0]}\n")
print(f"{t[0][0]}")
Output:
input values: [1 3]
type input: <class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>:
a [1 3]
dtype: object
first value: <class 'str'>:
[1 3]
[

Dictionary like get() on pandas Series with index value like iloc

Suppose I have following pandas' Series:
series=pd.Series(data=['A', 'B'], index=[2,3])
I can get the first value using .iloc as series.iloc[0] #output: 'A'
Or, I can just use the get method passing the actual index value like series.get(2) #output: 'A'
But is there any way to pass iloc like indexing to get method to get the first value in the series?
>>> series.get(0, '')
'' # Expected 'A', the first value for index 0
One way is to just reset the index and drop it then to call get method:
>>> series.reset_index(drop=True).get(0, '')
'A'
Is there any other function similar to get method (or a way to use get method) that allows passing iloc like index, without the overhead of resetting the index?
EDIT: Please be noted that this series comes from dataframe masking so this series can be empty, that is why I intentionally added default value as empty string '' for get method.
Use next with iter if need first value also with empty Series:
next(iter(series), None)
Is this what you have in mind or not really?
series.get(series.index[0])
What about using head to get the first row and take the value from there?
series.head(1).item() # A

Why am I getting a 'hashable' error when combining two dataframes?

I have two DataFrames and I'm attempting to combine them as follows:
df3 = df1.combine(df2, np.mean)
However, I'm getting the following error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed.
I'm not sure I understand why I'm getting the message as by definition DataFrames are mutable?
I don't get an error if I switch to:
df3 = df1.combine(df2, np.minimum)
Is this something to do with me having NaN values in the two DataFrames? If it is then what would be the solution? Devise my own function to replicate np.mean?
Updated:
I just found np.nanmean but that gives the following error:
TypeError: 'Series' object cannot be interpreted as an integer
np.mean takes one positional argument as the input array. So you cannot and should not do
np.mean(series1, series2)
Since the command above will interpret series2 as the second argument for np.mean, which is axis. But this argument is an integer and python try to convert series2 into one, which triggers the error.
In stead, you should do this for mean:
np.mean([series1, series2])
In the other case, np.minimum is designed to do:
np.minimum(series1, series2)
and gives the minimum element-wise as expected.
TLDR For mean, you can just do:
df = (df + df2)/2

Using the apply function to pandas dataframe with arguments

I created a function to take a column of a string datatype and ensure the first item in the string is always capitalized. Here is my function:
def myfunc(df, col):
transformed_df = df[col][0].capitalize() + df[col][1:]
return transformed_df
Using my function in my column of interest in my pandas dataframe:
df["mycol"].apply(myfunc)
I don't know why I get this error: TypeError: myfunc() missing 1 required positional argument: 'col'
Even adding axis to indicate that it should treat it column wise. I believe I am already passing my arguments why do I still need to specify col again? Correct me if I am wrong?
Your input is highly appreciated
If use Series.apply then each value of Series is processing separately, so need:
def myfunc(val):
return val[0].capitalize() + val[1:]
If want use pandas strings functions:
df["mycol"].str[0].str.capitalize() + df["mycol"].str[1:]
If want pass to function:
def myfunc(col):
return col.str[0].str.capitalize() + col.str[1:]
Then use Series.pipe for processing Series:
df["mycol"].pipe(myfunc)
Or:
myfunc(df["mycol"])

Trouble passing in lambda to apply for pandas DataFrame

I'm trying to apply a function to all rows of a pandas DataFrame (actually just one column in that DataFrame)
I'm sure this is a syntax error but I'm know sure what I'm doing wrong
df['col'].apply(lambda x, y:(x - y).total_seconds(), args=[d1], axis=1)
The col column contains a bunch a datetime.datetime objects and d1 is the earliest of them. I'm trying to get a column of the total number of seconds for each of the rows
EDIT I keep getting the following error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
I don't understand why axis is getting passed to my lambda function
EDIT 2
I've also tried doing
def diff_dates(d1, d2):
return (d1-d2).total_seconds()
df['col'].apply(diff_dates, args=[d1], axis=1)
And I get the same error
Note there is no axis param for a Series.apply call, as distinct to a DataFrame.apply call.
Series.apply(func, convert_dtype=True, args=(), **kwds)
func : function
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object
args : tuple
Positional arguments to pass to function in addition to the value
There is one for a df but it's unclear how you're expecting this to work when you're calling it on a series but you're expecting it to work on a row?

Categories

Resources