Fastest way to assign value to pandas cell - python

I have a function called calculate_distance, which takes 4 Pandas cells as an input and returns a new value that I want to assign it to a specific Pandas cell.
The 4 input values change dynamically as seen in the code below.
df['distance'] = ''
for i in range(1, df.shape[0]):
df.at[i, 'distance'] = calculate_distance(df['latitude'].iloc[i-1], df['longitude'].iloc[i-1], df['latitude'].iloc[i], df['longitude'].iloc[i])
Is there a faster way to do it than this "newbie" for loop?

You could use
df['distance'] = df.apply(your_calculate_distance_def, axis=1)
Its faster than loop. I dont know what your definition does. But apply will help you boost speed.
You may refer to Pandas apply documentation here for more help - pandas.DataFrame.apply

Related

Why does Pandas not come with an option to use .apply in place?

I like to think every design decision is made for a reason. A lot of pandas functions (e.g. df.drop , df.rename df.replace) come with a parameter, inplace. If you set it to True, instead of returning a new dataframe, pandas modifies the dataframe, well, in place. No surprises here ;).
However, I often find my self using df.apply in combination with lambda expression to do somewhat more complex operations on columns. Consider the following example:
Say I have text data that needs to be pre-processed for sentiment analysis. I would use:
def remove_punctuation(text):
no_punct = "".join([c for c in text if c not in string.punctuation])
return no_punct
And then adapt my column as follows:
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))
I recently noticed that .apply does not have an argument inplace=True. Since this function is mostly used to update dataframes, why is such an argument not available? What would be a rationale behind this?
pandas.DataFrame.apply and pandas.Series.apply both returns a Series either from a DataFrame or a Series. In your example you apply it to a Series and inplace might make sense there. However there are other applications where it wouldn't.
For example, with df being:
col1 col2
0 1 3
1 2 4
Doing:
s = df.apply(lambda x: x.col1 + x.col2, axis=1)
Would return a Series which has different type and shape than the original DataFrame.
In this case an inplace argument wouldn't make much sense.
I think pandas devs wanted to enforce consistency between pandas.DataFrame.apply and pandas.Series.apply, avoiding confusion generated by having an inplace argument in pandas.Series.apply only.

How to change the sequence of some variables in a Pandas dataframe?

I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks
Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')

Vectorized Flag Assignment in Dataframe

I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?
Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)

Pandas: apply a function with columns and a variable as argument

I'm trying to apply to a dataframe a function that has more than one argument, of which two need to be assigned to the dataframe's rows, and one is a variable (a simple number).
A variation from a similar thread works for the rows: (all functions are oversimplified compared to my original ones)
import pandas as pd
dict={'a':[-2,5,4,-6], 'b':[4,4,5,-8]}
df=pd.DataFrame (dict)
print(df)
def DummyFunction (row):
return row['a']*row['b']
#this works:
df['Dummy1']=df.apply(DummyFunction, axis=1)
But how can I apply the following variation, where my function takes in an additional argument (a fixed variable)? I seem to find no way to pass it inside the apply method:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
# where threshold will be assigned to a number?
# I don't seem to find a viable option to fill the row argument below:
# df['Dummy2']=df.apply(DummyFunction2(row,1000), axis=1)
Thanks for your help!
You can pass the additional variable directly as a named argument to pd.DataFrame.apply:
def DummyFunction2(row, threshold):
return row['a']*row['b']*threshold
df['Dummy2'] = df.apply(DummyFunction2, threshold=2, axis=1)

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources