Can apply function change the original input pandas df? - python

I always assume that the apply function won't change the original pandas dataframe and need the assignment to return the changes, however, could anyone help to explain why this happen?
def f(row):
row['a'] = 10
row['b'] = 20
df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1]}) #, 'd':[[1,2],[1,2],[1,2]]
df_x.apply(f, axis = 1)
df_x
returns
a b c
0 10 20 1
1 10 20 1
2 10 20 1
So, apply function changed the original pd.DataFrame without return, but if there's an non-basic type column in the data frame, then it won't do anything:
def f(row):
row['a'] = 10
row['b'] = 20
row['d'] = [0]
df_x = pd.DataFrame({'a':[10,11,12], 'b':[3,4,5], 'c':[1,1,1], 'd':[[1,2],[1,2],[1,2]]})
df_x.apply(f, axis = 1)
df_x
This return result without any change
a b c d
0 10 3 1 [1, 2]
1 11 4 1 [1, 2]
2 12 5 1 [1, 2]
Could anyone help to explain this or provide some reference? thx

Series are mutable objects. If you modify them during an operation, the changes will be reflected if no copy is made.
This is what happens in the first case. My guess: no copy is made as your DataFrame has a homogenous dtype (integer), so all the DataFrame is stored as a unique array internally.
In the second case, you have at least one item being a list. This make the dtype object, the DataFrame not a single dtype and apply must generate a new Series before running due to the mixed type of the row.
You can actually reproduce this just by changing a single element to another type:
def f(row):
row['a'] = 10
row['b'] = 20
df_x = pd.DataFrame({'a':[10,11,12],
'b':[3,4,5],
'c':[1,1.,1]}) # float
df_x.apply(f, axis = 1)
df_x
# different types
# no mutation
a b c
0 10 3 1.0
1 11 4 1.0
2 12 5 1.0
Take home message: never modify a mutable input in a function (unless you want it and know what you're doing).

Related

Python: how to multiply 2 columns?

I have the simple dataframe and I would like to add the column 'Pow_calkowita'. If 'liczba_kon' is 0, 'Pow_calkowita' is 'Powierzchn', but if 'liczba_kon' is not 0, 'Pow_calkowita' is 'liczba_kon' * 'Powierzchn. Why I can't do that?
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
row['Pow_calkowita'] = row['Powierzchn']
elif row['liczba_kon'] != 0:
row['Pow_calkowita'] = row['Powierzchn'] * row['liczba_kon']
My code didn't return any values.
liczba_kon Powierzchn
0 3 69.60495
1 1 39.27270
2 1 130.41225
3 1 129.29570
4 1 294.94400
5 1 64.79345
6 1 108.75560
7 1 35.12290
8 1 178.23905
9 1 263.00930
10 1 32.02235
11 1 125.41480
12 1 47.05420
13 1 45.97135
14 1 154.87120
15 1 37.17370
16 1 37.80705
17 1 38.78760
18 1 35.50065
19 1 74.68940
I have found some soultion:
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
Is it good way?
To write idiomatic code for Pandas and leverage on Pandas' efficient array processing, you should avoid writing codes to loop over the array by yourself. Pandas allows you to write succinct codes yet process efficiently by making use of vectorization over its efficient numpy ndarray data structure. Underlying, it uses fast array processing using optimized C language binary codes. Pandas already handles the necessary looping behind the scene and this is also an advantage using Pandas by single statement without explicitly writing loops to iterate over all elements. By using Pandas, you would better enjoy its fast efficient yet succinct vectorization processing instead.
As your formula is based on a condition, you cannot use direct multiplication. Instead you can use np.where() as follows:
import numpy as np
df['Pow_calkowita'] = np.where(df['liczba_kon'] == 0, df['Powierzchn'], df['Powierzchn'] * df['liczba_kon'])
When the test condition in first parameter is true, the value from second parameter is taken, else, the value from the third parameter is taken.
Test run output: (Add 2 more test cases at the end; one with 0 value of liczba_kon)
print(df)
liczba_kon Powierzchn Pow_calkowita
0 3 69.60495 208.81485
1 1 39.27270 39.27270
2 1 130.41225 130.41225
3 1 129.29570 129.29570
4 1 294.94400 294.94400
5 1 64.79345 64.79345
6 1 108.75560 108.75560
7 1 35.12290 35.12290
8 1 178.23905 178.23905
9 1 263.00930 263.00930
10 1 32.02235 32.02235
11 1 125.41480 125.41480
12 1 47.05420 47.05420
13 1 45.97135 45.97135
14 1 154.87120 154.87120
15 1 37.17370 37.17370
16 1 37.80705 37.80705
17 1 38.78760 38.78760
18 1 35.50065 35.50065
19 1 74.68940 74.68940
20 0 69.60495 69.60495
21 2 74.68940 149.37880
To answer the first question: "Why I can't do that?"
The documentation states (in the notes):
Because iterrows returns a Series for each row, ....
and
You should never modify something you are iterating over. [...] the iterator returns a copy and not a view, and writing to it will have no effect.
this basically means that it returns a new Series with the values of that row
So, what you are getting is NOT the actual row, and definitely NOT the dataframe!
BUT what you are doing is working, although not in the way that you want to:
df = DF(dict(a= [1,2,3], b= list("abc")))
df # To demonstrate what you are doing
a b
0 1 a
1 2 b
2 3 c
for index, row in df.iterrows():
... print("\n------------------\n>>> Next Row:\n")
... print(row)
... row["c"] = "ADDED" ####### HERE I am adding to 'the row'
... print("\n -- >> added:")
... print(row)
... print("----------------------")
...
------------------
Next Row: # as you can see, this Series has the same values
a 1 # as the row that it represents
b a
Name: 0, dtype: object
-- >> added:
a 1
b a
c ADDED # and adding to it works... but you aren't doing anything
Name: 0, dtype: object # with it, unless you append it to a list
----------------------
------------------
Next Row:
a 2
b b
Name: 1, dtype: object
### same here
-- >> added:
a 2
b b
c ADDED
Name: 1, dtype: object
----------------------
------------------
Next Row:
a 3
b c
Name: 2, dtype: object
### and here
-- >> added:
a 3
b c
c ADDED
Name: 2, dtype: object
----------------------
To answer the second question: "Is it good way?"
No.
Because using the multiplication like SeaBean has shown actually uses the power of
numpy and pandas, which are vectorized operations.
This is a link to a good article on vectorization in numpy arrays, which are basically the building blocks of pandas DataFrames and Series.
dataframe is designed to operate with vectorication. you can treat it as a database table. So you should use its functions as long as it's possible.
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
This simplified the code and enhanced performance., we can test their performance:
sampleSize = 100000
df=pd.DataFrame({
'liczba_kon': np.random.randint(3, size=(sampleSize)),
'Powierzchn': np.random.randint(1000, size=(sampleSize)),
})
# vectorication
s = time.time()
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
print(time.time() - s)
# iteration
s = time.time()
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
print(time.time() - s)
We can see vectorication performed much faster.
0.0034716129302978516
6.193516492843628

Pandas dataframe: creating a new column that is a custom function using 2 other columns

Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)

Applying lambda functions with updated values

Let us assume we are given the below function:
def f(x,y):
y = x + y
return y
The function f(x,y) sums two numbers (but it could be any more or less complicated functions of two arguments). Let us now consider the following
import pandas as pd
import random
import numpy as np
random.seed(1234)
df = pd.DataFrame({'first': random.sample(range(0, 9), 5),
'second': np.NaN}, index = None)
y = 1
df
first second
0 7 NaN
1 1 NaN
2 0 NaN
3 6 NaN
4 4 NaN
for the scope of the question the second column of the data frame is here irrelevant, so we can without loss of generality assume it to be NaN. Let us apply f(x,y) to each row of the data frame, considering that the variable y has been initialised to 1. The first iteration returns 7+1 = 8; now, when applying the function again to second row, we want the y value to be updated to the previously calculated 8 and therefore the final result to be 1+8 =9, and so on and so forth.
What is the pythonic way to handle this? I want to avoid looping and re-assigning the variables inside the loop, thus my guess would be something along the lines of
def apply_to_df(df, y):
result = df['first'].apply(lambda s: f(s,y))
return result
however one may easily see that the above does not consider the updated values, whereas it computes the all calculations with the initial original value for y=1.
print(apply_to_df(df,y))
0 8
1 2
2 1
3 7
4 5
Note, you can probably solve this specific case with an existing cumulative function. However, in the general case, you could just hack it by relying on global state:
In [7]: y = 1
In [8]: def f(x):
...: global y
...: y = x + y
...: return y
...:
In [9]: df['first'].apply(lambda s: f(s))
Out[9]:
0 8
1 9
2 9
3 15
4 19
Name: first, dtype: int64
I want to avoid looping and re-assigning the variables inside the loop
Note, pd.DataFrame.apply is a vanilla Python loop under the hood, and it's actually less efficient because it does a lot of checking/validation of inputs. It is not meant to be efficient, but convenient. So if you care about performance, you've already given up if you are relying on .apply
Honestly, I think I would rather write the explicit loop over the rows inside of a function than rely on global state.
You could use a generator function to remember the prior calculation result:
def my_generator(series, foo, y_seed=0):
y = y_seed # Seed value for `y`.
s = series.__iter__() # Create an iterator on the series.
while True:
# Call the function on the next `x` value together with the most recent `y` value.
y = foo(x=s.next(), y=y)
yield y
df = df.assign(new_col=list(my_generator(series=df['first'], foo=f, y_seed=1)))
>>> df
first second new_col
0 8 NaN 9
1 3 NaN 12
2 0 NaN 12
3 5 NaN 17
4 4 NaN 21

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Categories

Resources