Replacing large dataset Multiple Conditions Loop with faster alternative in Pandas Dataframe - python

I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!

I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))

UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)

Related

Python: Lambda function with multiple conditions based on multiple previous rows

I am trying to define a lambda function that assigns True or False to a row based on various conditions.
There is a column with a Timestamp and what I want is, that if within the last 10 seconds (based on the timestamp of the current row x) some specific values occured in other columns of the dataset, the current row x gets the True or False tag.
So basically I have to check whether in the previous n rows, i.e. Timestamp(x) - 10 seconds value a occured in column A and value b occured in column B.
I already looked at the shift() function with freq = 10 seconds and another attempt looked like that:
data['Timestamp'][(data['Timestamp']-pd.Timedelta(seconds=10)):data['Timestamp']]
But I wasn't able to proceed with either of the two options.
Is it possible to start an additional select within a lambda function? If yes, how could that look like?
P.S.: Working with regular for-loops instead of the lambda function is not an option due to the overall setup of the application/code.
Thanks for your help and input!
Perhaps you're looking for something like this, if I understood correctly:
def create_tag(current_timestamp, df, cols_vals):
# Before the current timestamp
mask = (df['Timestamp'] <= current_timestamp)
# After the current timestamp - 10s
mask = mask & (df['Timestamp'] >= current_timestamp - pd.to_timedelta('10s'))
# Filter all dataframe following the mask
filtered = df[mask]
# Check if each val of col is present
present = all(value in filtered[column_name].values for column_name, value in cols_vals.items())
return present
data['Tag'] = data['Timestamp'].apply(lambda x: create_tag(x, data, {'column A': 'a', 'column B', 'b'}))
The idea behind this code is, for each timestamp that you have, we're going to apply the create_tag function. This takes the current timestamp, the whole dataframe as well as a dictionary containing column names as keys and the respective values you're looking for as values.

Updating a dataframe by referencing row values and column values without iterating

I am rusty with Pandas and Dataframes.
I have one dataframe (named data) with two columns (userid, date).
I have a second dataframe, incidence_matrix, where the rows are userids (the same userids in data) and the columns are dates (the same dates in data). This is how I construct incidence_matrix:
columns = pd.date_range(start='2020-01-01', end='2020-11-30', freq='M', closed='right')
index = data['USERID']
incidence_matrix = pd.DataFrame(index=index, columns=columns)
incidence_matrix = incidence_matrix.fillna(0)
I am trying to iterate over each (userid, date) pair in data, and using the values of each userid and date, update that corresponding cell in incidence_matrix to be 1.
In production data could be millions of rows. So I'd prefer to not iterate over the data and use a vectorization approach.
How can (or should) the above be done?
I am running into errors when attempting to reference cells by name, for example in my attempt below, the first print statement works but the second print statement doesn't recognize a date value as a label
for index, row in data.iterrows():
print(row['USERID'], row['POSTDATE'])
print(incidence_matrix.loc[row['USERID']][row['POSTDATE']])
Thank you in advance.
Warning: the representation you have chosen is going to be pretty sparse in real life (user visits typically follow a Zipf law), leading to a quite inefficient usage of the memory. You'd be better off representing your incidence as a tall and thin DataFrame, for example the output of:
data.groupby(['userid', data['date'].dt.to_period('M')]).count()
With this caveat out of the way:
def add_new_data(data, incidence=None):
delta_incidence = (
data
.groupby(['userid', data['date'].dt.to_period('M')])
.count()
.squeeze()
.unstack('date', fill_value=0)
)
if incidence is None:
return delta_incidence
return incidence.combine(delta_incidence, np.add, fill_value=0).astype(int)
should do what you want. It re-indexes the previous value of incidence (if any) such that the outcome is a new DataFrame where the axes are the union of incidence and delta_incidence.
Here is a toy example, for testing:
def gen_data(n):
return pd.DataFrame(
dict(
userid=np.random.choice('bob alice john james sophia'.split(), size=n),
date=[
(pd.Timestamp('2020-01-01') + v * pd.Timedelta('365 days')).round('s')
for v in np.random.uniform(size=n)
],
)
)
# first time (no previous incidence)
data = gen_data(20)
incidence = add_new_data(data)
# new data arrives
data = gen_data(30)
incidence = add_new_data(data, incidence)

Compare consecutive dataframe rows based on columns in Python

I have a dataframe. It has data about suppliers. If the name of the supplier and group are same, number of units should ideally be the same. However, sometimes that is not the case. I am writing code that imports data from SQL into Python then compares for these numbers.
This is for Python 3. Importing the data into Python was easy. I am a Python rookie. To make things easier for myself, I created individual dataframes for each supplier to compare numbers instead of looking at the whole dataframe at once.
supp = data['Supplier']
supplier = []
for s in supp:
if s not in Supplier:
supplier.append(s)
su = "Authentic Brands Group LLC"
deal = defaultdict(list)
blist = []
glist = []
columns = ['Supplier','ID','Units','Grp']
df3 = pd.DataFrame(columns=columns)
def add_row(df3, row):
df3.loc[-1] = row
df3.index = df3.index + 1
return df3.sort_index()
for row in data.itertuples():
for x in supplier:
s1 = row.Supplier
if s1 == su:
if row.Supplier_Group not in glist:
glist.append(row.Supplier_Group)
for g in range(len(glist)):
if glist[g]==row.Supplier_Group:
supp = x
blist=[]
blist.append(row.ID)
blist.append(row.Units)
blist.append(glist[g])
add_row(df3,[b1,row.ID,row.Units,glist[g]])
break
break
break
for i in range(1,len(df3)):
if df3.Supplier.loc[i] == df3.Supplier.loc[i-1] and df3.Grp.loc[i] == df3.Grp.loc[i-1]:
print(df3.Supplier,df3.Grp)
This gives me a small subset that looks like this:
Now I want to look at the supplier name and Grp, if they are same as others in dataframe, Units should be same. In this case, row 2 is incorrect. Units should be 100. I want to add another column to this dataframe that says 'False' if the number of Units is correct. This is the tricky part for me. I can iterate over the rows, but I'm unsure how to compare them and add column.
I'm stuck at this point.
Any help is highly appreciated. Thank you.
If you have all of your data in a single dataframe, df, you can do the following:
grp_by_cols = ['Supplier', 'ID', 'Grp']
all_cols = grp_by_cols + ['Unit']
res_df = df.assign(first_unit=lambda df: df.loc[:, all_cols]
.groupby(grp_by_cols)
.transform('first'))\
.assign(incorrect=lambda df: df['Unit'] == df['first_unit'])\
.assign(incorrect=lambda df: df.loc[:, grp_by_cols + ['incorrect']])\
.groupby(grp_by_cols)
.transform(np.any))
The first call to assign adds a single new column (called 'first_unit') that is the first value of "Unit" for each group of Supplier/ID/Grp (see grp_by_cols).
The second call to assign adds a column called 'incorrect' that is True when 'Unit' doesn't equal 'first_unit'. The third and final assign call overwrites that column to be True if any rows in that group are True. You can remove that if that's not what you want.
Then, if you want to look at the results for a single supplier, you can do something like:
res_df.query('Supplier = "Authentic Brands Group"')

How to multi-thread large number of pandas dataframe selection calls on large dataset

df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel
You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(pd.np.random.rand(10))
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype(pd.np.int) # Convert to ints.
Note that this may fail if duplicate indices are present.
I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
selected.append(random.choice(sec))
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures https://docs.python.org/3/library/concurrent.futures.html

Append pandas dataframe with column of averaged values from string matches in another dataframe

Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
matchingOrders.append(price)
avgPrice.append(np.mean(matchingOrders))
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice
In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]

Categories

Resources