I've been using pandas apply method for both series and dataframe, but I am obviously still missing something, because I'm stumped on a simple function i'm trying to execute.
This is what I was doing:
def minmax(row):
return (row - row.min())/(row.max() - row.min())
row.apply(minmax)
but, this returns an all zero Series. For example, if
row = pd.Series([0, 1, 2])
then
minmax(row)
returns [0.0, 0.5, 1.0], as desired. But, row.apply(minmax) returns [0,0,0].
I believe this is because the series is of ints and the integer division returns 0. However, I don't understand,
why it works with minmax(row) (shouldn't it act the same?), and
how to cast it correctly in the apply function to return appropriate float values (i've tried to cast it using .astype and this gives me all NaNs... which I also don't understand)
if apply this to a dataframe, as df.apply(minmax) it also works as desired. (edit added)
i suspect i'm missing something fundamental in how the apply works... or being dense. either way, thanks in advance.
When you call row.apply(minmax) on a Series only the values are passed to the function. This is called element-wise.
Invoke function on values of Series. Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
When you call row.apply(minmax) on a DataFrame either rows (default) or columns are passed to the function (according to the value of axis).
Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. This is called row-wise or column-wise.
This is why your example works as expected on the DataFrame and not on the Series. Check this answer for information on mapping functions to Series.
Related
I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400
I'm learning Python and want to use the "apply" function. Reading around the manual I found that if a I have a simple dataframe like this:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
A B
0 4 9
1 4 9
2 4 9
and then I use something like this:
df.apply(lambda x:x.sum(),axis=0)
output works because according to theory x receives every column and apply the sum to each so the the result is correctly this:
A 12
B 27
dtype: int64
When instead I issue something like:
df['A'].apply(lambda x:x.sum())
result is: 'int' object has no attribute 'sum'
question is: why is that working on a dataframe by column, it's ok and working on a single column is not ? In the end the logic should be the same. x should receive in input one column instead of two.
I know that for this simple example I should use other functions like df.agg or even df['A'].sum() but the question is to understand the logic of apply.
if you look at a specific column of a pandas.DataFrame object, you working with a pandas.Series with (in your case) integers as values. Well and integers don't have a sum() method.
(Run type(df['A']) to see that you are working with a series and not a data frame anymore when slicing a single column).
The irritating part is that if you work with an actual pandas.DataFrame object, every column is a pandas.Series object and they have a sum() method.
So there are two ways to fix your problem
Work with a pandas.DataFrame and not with a pandas.Series: df[['A']]. The additional brackets force pandas to return a pandas.DataFrame object. (Verify by type(df[['A']])) and use the lambda function just as you did before
use a function rather than a method when using lambda: df['A'].apply(lambda x: np.sum(x)) (assuming that you have imported numpy as np)
I would recommend to go with the second option as it seems to me the more generic and clearer way
However, this is only relevant if you want to apply a certain function to ever element in a pandas.Series or pandas.DataFrame. In your specific case, there is no need to take the detour that your are currently using. Just use df.sum(axis=0).
The approach with apply is over complicating things. The reason why this works is that every element of a pandas.DataFrame is a pandas.Series, which as a sum method. But so does a pandas.DataFrame has, so you can use this right away.
The only way, where you actually need to take the way with apply is if you had arrays in every cell of the pandas.DataFrame
I am wondering why pandas assign function cannot handle returned lists.
For example
df = pd.DataFrame({
"id" : [1,2,3,4,5],
"val" : [10,20,30,30,40]
})
def squareMe(x):
return x**2
df = df.assign(val2 = lambda x: squareMe(x.val))
# Out > Works fine : Returns a DataFrame with squared values
But if we return a list,
def squareMe(x):
return [x**2]
df = df.assign(val2 = lambda x: squareMe(x.val))
#Out > ValueError: Length of values (1) does not match length of index (5)
However pandas apply function works fine when returning a list
def squareMe(x):
return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))
Any particular reason why this is or am I doing something wrong?
Since you reference x.val in the call to squareMe, that function is passed a list (you can easily verify this by adding a debug statement to print type(x) inside the function).
Thus, x ** 2 returns a Series (since the expression is vectorized) and the assignment works correctly.
But when you return [x ** 2] you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5) means).
The difference is with apply is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.
More information: df.assign, df.apply
P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2
assign isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.
Docs:
**Parameters : kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it). If the values
are not callable, (e.g. a Series, scalar, or array), they are simply
assigned.
Doing [x ** 2] returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:
ValueError: Length of values (1) does not match length of index (5)
The length of values wouldn't match to the index.
This has been driving me nuts all day in my own work, but I've got it now.
cs95 is almost correct, but not quite. If you follow their advice and put a print(f"{type(x)}") in your squareMe function you'll see that it's a Series, not a list.
That's the catch, x.val is always a Series (the entire column of values), and squareMe returns a Series.
In contrast, apply, if you specify axis=1, will iterate over each row in the column, so each value of x.val and pass each one to squareMe, building a new Series for your new column in the process.
The reason it confused you (and me!) is that, when it works in your first example, it looks like squareMe is operating on integers and returning an integer for each row. But in fact, it's taking advantage of operator overloading to square the Series, not individual values: It's using the pow function, which is aliased as **, which like the other overloaded operators on Series, works element-wise.
Now, when you change squareMe to return the list of the result: [x**2], it's again squaring the entire Series to get a new Series of squares, but then making a list of that Series. That is, a list of a single element, the element being a Series.
Now assign was expecting a Series back from squareMe of the same length as the index of the dataframe, which is 5, and you returned it a list with a single element - hence the error: expected length 5, got 1.
Your apply, in the meantime, is working on the Series val because that's what you called it on, and it's iterating over the values in that series. Another way to do the apply, which is closer to your assign is this:
df["val2"] = df.apply(lambda x: squareMe(x.val), axis=1)
Pandas loc method when used with row filter throws an error
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
IndexingError: Unalignable boolean Series provided as indexer (index
of the boolean Series and of the indexed object do not match).
whereas the same code without row filter works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
steps to reproduce
test=pd.DataFrame({"holiday":[0,0,0],"weekday":[1,2,3],"workingday":[1,1,1]})
test[test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)] ##works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)] ##fails
I am trying to understand what is the difference between these two which makes one fail whereas the other one succeed
So the basic syntax is DataFrame[things to look for, e.g row slices or columns]
With that in mind, you are trying to filter your dataframe test with the following commands (these are the code snippets in the brackets):
test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)
This returns True for every row in the dataframe and therefore the "filter" returns the entire dataframe
test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)
This part itself is working and it is doing so by slicing the rows 0 and 1 and then applying the lambda function. Therefore, the "filter" consists of True in only 2 rows. Now the point is, that there is no value for the third row and this causes your error: The indices of the dataframe that has to be sliced (3 rows) and the boolean Series used to slice it (2 values) don't match.
Solving this problem depends on what you actually want as your output, i.e. whether the lambda function is supposed to be applied only to a subset of the data or whether you want only a subset of the results being retrieved to work with.
I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.