I have a function called calculate_distance, which takes 4 Pandas cells as an input and returns a new value that I want to assign it to a specific Pandas cell.
The 4 input values change dynamically as seen in the code below.
df['distance'] = ''
for i in range(1, df.shape[0]):
df.at[i, 'distance'] = calculate_distance(df['latitude'].iloc[i-1], df['longitude'].iloc[i-1], df['latitude'].iloc[i], df['longitude'].iloc[i])
Is there a faster way to do it than this "newbie" for loop?
You could use
df['distance'] = df.apply(your_calculate_distance_def, axis=1)
Its faster than loop. I dont know what your definition does. But apply will help you boost speed.
You may refer to Pandas apply documentation here for more help - pandas.DataFrame.apply
Related
I like to think every design decision is made for a reason. A lot of pandas functions (e.g. df.drop , df.rename df.replace) come with a parameter, inplace. If you set it to True, instead of returning a new dataframe, pandas modifies the dataframe, well, in place. No surprises here ;).
However, I often find my self using df.apply in combination with lambda expression to do somewhat more complex operations on columns. Consider the following example:
Say I have text data that needs to be pre-processed for sentiment analysis. I would use:
def remove_punctuation(text):
no_punct = "".join([c for c in text if c not in string.punctuation])
return no_punct
And then adapt my column as follows:
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))
I recently noticed that .apply does not have an argument inplace=True. Since this function is mostly used to update dataframes, why is such an argument not available? What would be a rationale behind this?
pandas.DataFrame.apply and pandas.Series.apply both returns a Series either from a DataFrame or a Series. In your example you apply it to a Series and inplace might make sense there. However there are other applications where it wouldn't.
For example, with df being:
col1 col2
0 1 3
1 2 4
Doing:
s = df.apply(lambda x: x.col1 + x.col2, axis=1)
Would return a Series which has different type and shape than the original DataFrame.
In this case an inplace argument wouldn't make much sense.
I think pandas devs wanted to enforce consistency between pandas.DataFrame.apply and pandas.Series.apply, avoiding confusion generated by having an inplace argument in pandas.Series.apply only.
I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks
Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')
I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?
Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)
I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)
What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P