Python: Lambda function with multiple conditions based on multiple previous rows - python

I am trying to define a lambda function that assigns True or False to a row based on various conditions.
There is a column with a Timestamp and what I want is, that if within the last 10 seconds (based on the timestamp of the current row x) some specific values occured in other columns of the dataset, the current row x gets the True or False tag.
So basically I have to check whether in the previous n rows, i.e. Timestamp(x) - 10 seconds value a occured in column A and value b occured in column B.
I already looked at the shift() function with freq = 10 seconds and another attempt looked like that:
data['Timestamp'][(data['Timestamp']-pd.Timedelta(seconds=10)):data['Timestamp']]
But I wasn't able to proceed with either of the two options.
Is it possible to start an additional select within a lambda function? If yes, how could that look like?
P.S.: Working with regular for-loops instead of the lambda function is not an option due to the overall setup of the application/code.
Thanks for your help and input!

Perhaps you're looking for something like this, if I understood correctly:
def create_tag(current_timestamp, df, cols_vals):
# Before the current timestamp
mask = (df['Timestamp'] <= current_timestamp)
# After the current timestamp - 10s
mask = mask & (df['Timestamp'] >= current_timestamp - pd.to_timedelta('10s'))
# Filter all dataframe following the mask
filtered = df[mask]
# Check if each val of col is present
present = all(value in filtered[column_name].values for column_name, value in cols_vals.items())
return present
data['Tag'] = data['Timestamp'].apply(lambda x: create_tag(x, data, {'column A': 'a', 'column B', 'b'}))
The idea behind this code is, for each timestamp that you have, we're going to apply the create_tag function. This takes the current timestamp, the whole dataframe as well as a dictionary containing column names as keys and the respective values you're looking for as values.

Related

Replacing large dataset Multiple Conditions Loop with faster alternative in Pandas Dataframe

I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!
I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))
UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)

Panda groupby shifting and count at same time

Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)

Get nth row after applying lambda on groupby in python

So I need to group a dataframe by its SessionId, and then I need to sort each group with the created time, afterwards i need to retrieve the nth row only of each group.
but i found that after applying lambda it becomes a dataframe instead of a group by object, hence i cannot use the .nth property
grouped = df.groupby(['SessionId'])
sorted = grouped.apply(lambda x: x.sort_values(["Created"], ascending = True))
sorted.nth ---> error
Changing the order in which you are approaching the problem in this case will help. If you first sort and then use groupby, you will get the desired output and you can use the groupby.nth function.
Here is a code snippet to demonstrate the idea:
df = pd.DataFrame({'id':['a','a','a','b','b','b'],
'var1':[3,2,1,8,7,6],
'var2':['g','h','i','j','k','l']})
n = 2 # replace with required row from each group
df.sort_values(['id','var1']).groupby('id').nth(n).reset_index()
Assuming id is your sessionid and var1 is the timestamp, this sorts your dataframe by id and then var1. Then picks up the nth row from each of these sorted groups. The reset_index() is there just to avoid the resulting multi-index.
If you want to get the last n rows of each group, you can use .tail(n) instead of .nth(n).
I have created a small dataset -
n = 2
grouped = df.groupby('SessionId')
pd.concat([grouped.get_group(x).sort_values(by='SortVar').reset_index().loc[[n]] for x in grouped.groups]\
,axis=0)
This will return -
Please note that in python index start from zero, so for n=2, it will give you 3rd row in sorted data

Slice a dataframe based on one column starting with the value of another column

I have a dataframe called data, that looks like like this:
|...|category|...|ngram|...|
I need to slice this dataframe to instances where category starts with the value of ngram. So for example, if I had the following instance:
category: beds
ngram: bed
then that instance should be dropped from the resulting dataframe.
In T-SQL, I use the following query (which may not be the best way, but it works):
SELECT
*
FROM mytable
WHERE category NOT LIKE ngram+'%';
I have read up on this a bit, and my best attempt is:
data[data.category.str.startswith(data.ngram.str) == True]
But this does not return any rows, nor does the inverse (using == True)
#use df.apply to filter the rows with category starts with ngram.
data[data.apply(lambda x: x.category.startswith(x.ngram), axis=1)]

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...
np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')
You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)
One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Categories

Resources