Let us assume we are given the below function:
def f(x,y):
y = x + y
return y
The function f(x,y) sums two numbers (but it could be any more or less complicated functions of two arguments). Let us now consider the following
import pandas as pd
import random
import numpy as np
random.seed(1234)
df = pd.DataFrame({'first': random.sample(range(0, 9), 5),
'second': np.NaN}, index = None)
y = 1
df
first second
0 7 NaN
1 1 NaN
2 0 NaN
3 6 NaN
4 4 NaN
for the scope of the question the second column of the data frame is here irrelevant, so we can without loss of generality assume it to be NaN. Let us apply f(x,y) to each row of the data frame, considering that the variable y has been initialised to 1. The first iteration returns 7+1 = 8; now, when applying the function again to second row, we want the y value to be updated to the previously calculated 8 and therefore the final result to be 1+8 =9, and so on and so forth.
What is the pythonic way to handle this? I want to avoid looping and re-assigning the variables inside the loop, thus my guess would be something along the lines of
def apply_to_df(df, y):
result = df['first'].apply(lambda s: f(s,y))
return result
however one may easily see that the above does not consider the updated values, whereas it computes the all calculations with the initial original value for y=1.
print(apply_to_df(df,y))
0 8
1 2
2 1
3 7
4 5
Note, you can probably solve this specific case with an existing cumulative function. However, in the general case, you could just hack it by relying on global state:
In [7]: y = 1
In [8]: def f(x):
...: global y
...: y = x + y
...: return y
...:
In [9]: df['first'].apply(lambda s: f(s))
Out[9]:
0 8
1 9
2 9
3 15
4 19
Name: first, dtype: int64
I want to avoid looping and re-assigning the variables inside the loop
Note, pd.DataFrame.apply is a vanilla Python loop under the hood, and it's actually less efficient because it does a lot of checking/validation of inputs. It is not meant to be efficient, but convenient. So if you care about performance, you've already given up if you are relying on .apply
Honestly, I think I would rather write the explicit loop over the rows inside of a function than rely on global state.
You could use a generator function to remember the prior calculation result:
def my_generator(series, foo, y_seed=0):
y = y_seed # Seed value for `y`.
s = series.__iter__() # Create an iterator on the series.
while True:
# Call the function on the next `x` value together with the most recent `y` value.
y = foo(x=s.next(), y=y)
yield y
df = df.assign(new_col=list(my_generator(series=df['first'], foo=f, y_seed=1)))
>>> df
first second new_col
0 8 NaN 9
1 3 NaN 12
2 0 NaN 12
3 5 NaN 17
4 4 NaN 21
Related
I always assume that the apply function won't change the original pandas dataframe and need the assignment to return the changes, however, could anyone help to explain why this happen?
def f(row):
row['a'] = 10
row['b'] = 20
df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1]}) #, 'd':[[1,2],[1,2],[1,2]]
df_x.apply(f, axis = 1)
df_x
returns
a b c
0 10 20 1
1 10 20 1
2 10 20 1
So, apply function changed the original pd.DataFrame without return, but if there's an non-basic type column in the data frame, then it won't do anything:
def f(row):
row['a'] = 10
row['b'] = 20
row['d'] = [0]
df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1], 'd':[[1,2],[1,2],[1,2]]})
df_x.apply(f, axis = 1)
df_x
This return result without any change
a b c d
0 10 3 1 [1, 2]
1 11 4 1 [1, 2]
2 12 5 1 [1, 2]
Could anyone help to explain this or provide some reference? thx
Series are mutable objects. If you modify them during an operation, the changes will be reflected if no copy is made.
This is what happens in the first case. My guess: no copy is made as your DataFrame has a homogenous dtype (integer), so all the DataFrame is stored as a unique array internally.
In the second case, you have at least one item being a list. This make the dtype object, the DataFrame not a single dtype and apply must generate a new Series before running due to the mixed type of the row.
You can actually reproduce this just by changing a single element to another type:
def f(row):
row['a'] = 10
row['b'] = 20
df_x = pd.DataFrame({'a':[10,11,12],
'b':[3,4,5],
'c':[1,1.,1]}) # float
df_x.apply(f, axis = 1)
df_x
# different types
# no mutation
a b c
0 10 3 1.0
1 11 4 1.0
2 12 5 1.0
Take home message: never modify a mutable input in a function (unless you want it and know what you're doing).
this code gives the result i want:
it takes the value n-1 and calculates n from it
take the previous value in column 'y' lets call it y-1 and calculate a value which becomes the new y, than in the next row take that new y as y-1 and calculate another new y aso
size = 10
x= range(1,size+1)
df = pd.DataFrame(data={'x': x,'y': size })
for n in range(1,len(x)):
df['y'].iloc[n] = df['y'].iloc[n-1]*2
out:
x y
0 1 10
1 2 20
2 3 40
... ... ...
9 10 5120
I want to put it into a lambda but somehow fail to get it right:
b=2
df['y'].loc[1::] = df['y'].shift(-1).apply(lambda x: x*b)
out:
x y
0 1 10.0
1 2 20.0
2 3 20.0
... ... ...
the lambda function takes the pre-populated value (10) in each row instead of shifting 1 step back and taking the previous value as base for the multiplication
i looked at some threads, but its above my comprahension, if i am dealing here with recursion and this is not possible with lambdas?
recursive lambda-expressions possible?
Python Recursion on Pandas df
Can a lambda function call itself recursively in Python?
Edit:
I want that subsequent entries in 'y' are calculated with previous 'y' entries, starting from idx 1
DataFrame at start:
idx | y |
0 10
DataFrame after 1st calc:
y1 = y0 *2
# *2 is a placeholder could be mx+b, or something else
idx | y |
0 10
1 20
I'm not sure you need to do any recursion, unless I'm misunderstanding this is mathematical exponents.
Note sure what you're actual use case but something like one of these should work.
[v*(2**i) for i,v in enumerate(df.y)]
df.apply(lambda j: j.y*(2**(j.x-1)), axis=1)
So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9
Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)