Vectorized Flag Assignment in Dataframe - python

I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?

Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)

Related

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

Python Dataframe: how do I perform vlookup equivalent

this is the df (which is a subset of a much larger df)
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
and if I need the Tval for say '06/04/2020' (just a single date i need value for). how do I get it? I know merge and join can be used to replicate vlookup function in python but what if its a single value you looking for? Whats the best way to perform the task?
Pandas docs recommend using loc or iloc for most of lookups:
df = pd.DataFrame({
'Date': ['04/03/2020', '06/04/2020','08/06/2020','12/12/2020'],
'Tval' : [0.01,0.015,-0.023,-0.0005]
})
df.loc[df.Date == '06/04/2020', 'Tval']
Here the first part of the expression in the brackets df.Date == '06/04/2020' selects an index of a row(s) you want to see and the second part specifies which column(s) you want to have displayed.
If instead you wanted to see the data for the entire row, you could re-write it as df.loc[df.Date == '06/04/2020', : ].
Selecting in a dataframe works like this:
df.loc[df.Date == '06/06/2020', "Tval"]
The way to make sense of this is:
df.Date=='06/06/2020'
Out:
0 False
1 False
2 False
3 False
Name: Date, dtype: bool
produces a series of True/False values showing for which rows in the Date column match the equality. If you give a DataFrame such a series, it will select out only the rows where the series is True. To see that, see what you get from:
df.loc[df.Date=='06/06/2020']
Out[]:
Date Tval
2 08/06/2020 -0.023
Finally, to see the values of 'Tval' we just do:
df.loc[df.Date == '06/06/2020', "Tval"]

Dataframe sorting does not apply when using .loc

I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.

Fastest way to assign value to pandas cell

I have a function called calculate_distance, which takes 4 Pandas cells as an input and returns a new value that I want to assign it to a specific Pandas cell.
The 4 input values change dynamically as seen in the code below.
df['distance'] = ''
for i in range(1, df.shape[0]):
df.at[i, 'distance'] = calculate_distance(df['latitude'].iloc[i-1], df['longitude'].iloc[i-1], df['latitude'].iloc[i], df['longitude'].iloc[i])
Is there a faster way to do it than this "newbie" for loop?
You could use
df['distance'] = df.apply(your_calculate_distance_def, axis=1)
Its faster than loop. I dont know what your definition does. But apply will help you boost speed.
You may refer to Pandas apply documentation here for more help - pandas.DataFrame.apply

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources