I am rusty with Pandas and Dataframes.
I have one dataframe (named data) with two columns (userid, date).
I have a second dataframe, incidence_matrix, where the rows are userids (the same userids in data) and the columns are dates (the same dates in data). This is how I construct incidence_matrix:
columns = pd.date_range(start='2020-01-01', end='2020-11-30', freq='M', closed='right')
index = data['USERID']
incidence_matrix = pd.DataFrame(index=index, columns=columns)
incidence_matrix = incidence_matrix.fillna(0)
I am trying to iterate over each (userid, date) pair in data, and using the values of each userid and date, update that corresponding cell in incidence_matrix to be 1.
In production data could be millions of rows. So I'd prefer to not iterate over the data and use a vectorization approach.
How can (or should) the above be done?
I am running into errors when attempting to reference cells by name, for example in my attempt below, the first print statement works but the second print statement doesn't recognize a date value as a label
for index, row in data.iterrows():
print(row['USERID'], row['POSTDATE'])
print(incidence_matrix.loc[row['USERID']][row['POSTDATE']])
Thank you in advance.
Warning: the representation you have chosen is going to be pretty sparse in real life (user visits typically follow a Zipf law), leading to a quite inefficient usage of the memory. You'd be better off representing your incidence as a tall and thin DataFrame, for example the output of:
data.groupby(['userid', data['date'].dt.to_period('M')]).count()
With this caveat out of the way:
def add_new_data(data, incidence=None):
delta_incidence = (
data
.groupby(['userid', data['date'].dt.to_period('M')])
.count()
.squeeze()
.unstack('date', fill_value=0)
)
if incidence is None:
return delta_incidence
return incidence.combine(delta_incidence, np.add, fill_value=0).astype(int)
should do what you want. It re-indexes the previous value of incidence (if any) such that the outcome is a new DataFrame where the axes are the union of incidence and delta_incidence.
Here is a toy example, for testing:
def gen_data(n):
return pd.DataFrame(
dict(
userid=np.random.choice('bob alice john james sophia'.split(), size=n),
date=[
(pd.Timestamp('2020-01-01') + v * pd.Timedelta('365 days')).round('s')
for v in np.random.uniform(size=n)
],
)
)
# first time (no previous incidence)
data = gen_data(20)
incidence = add_new_data(data)
# new data arrives
data = gen_data(30)
incidence = add_new_data(data, incidence)
Related
I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).
I want to avoid hard coding a filter for each sector or sub sector to find specific data. I have achieved this with the following code (I know, not very pythonic, but I am new to coding):
for x in set(sectors_df['Sector']):
x_filt = sectors_df['Sector'] == x
#value in sect takes the sum of all current values in a given sector
value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
#pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors)
pct_in_sect = round((value_in_sect/total)*100 , 2)
print(x, value_in_sect, pct_in_sect)
for sub in set(sectors_df['Sub Sector']):
sub_filt = sectors_df['Sub Sector'] == sub
value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
pct_of_subs = round((value_of_subs/total)*100, 2)
print(sub, value_of_subs, pct_of_subs)
My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector. Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own. What would be the best way or the smartest way or the most pythonic way to go about this? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.
The split-apply-combine process in pandas, specifically aggregation, is the best way to go about this. First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.
Manual split-apply-combine
Split
First, divide the DataFrame into groups of the same Sector. This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code). This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.
sectors_index = {}
for ix, row in sectors_df.iterrows():
if row['Sector'] not in sectors_index:
sectors_index[row['Sector']] = [ix]
else:
sectors_index[row['Sector']].append(ix)
Apply
Run the same function, in this case summing of Current Value and calculating its percentage share, on each group. That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code. I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:
sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
sector_df = sectors_df.loc[row_indices]
current_value = sector_df['Current Value'].sum()
sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
'pct_in_sect': round(current_value/total_value * 100, 2)
}
(see footnote 1 for a note on rounding)
Combine
Finally, collect the function results into a new DataFrame, where the index is the Sector. pandas can easily convert this nested dictionary structure into a DataFrame:
sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')
split-apply-combine using groupby
pandas makes this process very simple using the groupby method.
Split
The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series):
grouped_by_sector = sectors_df.groupby('Sector')
grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps.
Apply
To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:
sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)
Combine
It's already done! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.
I've left out the pct_in_sect part because a) it can be more easily done after the fact:
sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)
and b) it's outside the scope of this answer.
Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)
For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.
subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)
This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'. For a final piece of magic, the percentage in Sector can be calculated quite easily:
subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)
which works because the 'Sector' index level is matched during division.
Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage. Ideally though, rounding is only done at display time.
Footnote 2. This one-liner generates the desired result, including percentage and rounding:
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
value_in_sect = lambda c: round(sum(c), 2),
pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),
)
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!
I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))
UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)
Alright I could use a little help on this one. I've created a function that I can feed two multi index dataframes into, as well as a list of kwargs, and the function will then take values from one dataframe and add them to the other into a new column.
Just to try to make sure that I'm explaining it well enough, the two dataframes are both stock info, where one dataframe is my "universe" or stocks that I'm analyzing, and the other is a dataframe of market and sector ETFs.
So what my function does is takes kwargs in the form of:
new_stock_column_name = "existing_sector_column_name"
Here is my actual function:
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
I believe my function works exactly as I intend, except it takes a ridiculously long time to loop through the data. There has got to be a better way to do this, so any help you could provide would be greatly appreciated.
I apologize in advance, but I'm having trouble coming up with two simple example dataframes, but the only real requirement is that the stock dataframe has a column that lists its sector ETF in it, and that column value directly coincides to the level 1 index of the ETF dataframe.
The exception handler is in place simply because the ETFs sometimes do not exist for all the dates of the analysis, in which case I don't mind that the values stay as NaN.
Update:
Here is a revised code snippet that will allow you to run the function to see what I'm talking about. Forgive me, my coding skills are limited.
import pandas as pd
import numpy as np
stock_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['aapl', 'amzn', 'aapl', 'amzn'])]
stock_tuples = list(zip(*stock_arrays))
stock_index = pd.MultiIndex.from_tuples(stock_tuples, names=['date', 'stock'])
etf_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['xly', 'xlk','xly', 'xlk'])]
etf_tuples = list(zip(*etf_arrays))
etf_index = pd.MultiIndex.from_tuples(etf_tuples, names=['date', 'stock'])
stock_df = pd.DataFrame(np.random.randn(4), index=stock_index, columns=['close_price'])
etf_df = pd.DataFrame(np.random.randn(4), index=etf_index, columns=['close_price'])
stock_df['xl_sect'] = np.array(['xlk', 'xly','xlk', 'xly'])
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
Now after running the above in a cell, you can access the function by calling it like this:
new_stock_df = map_columns(stock_df, etf_df, sect_etf_close='close_price')
new_stock_df
I hope this is more helpful. And as you can see, the function works, but with really large datasets it's extremely slow and inefficient.
Right now I have two dataframes (let's call them A and B) built from Excel imports. Both have different dimensions as well as some empty/NaN cells. Let's say A is data for individual model numbers and B is a set of order information. For every row (unique item) in A, I want to search B for the (possibly) multiple orders for that item number, average the corresponding prices, and append A with a column containing the average price for each item.
The item numbers are alphanumeric so they have to be strings. Not every item will have orders/pricing information for it and I'll be removing those at the next step. This is a large amount of data so efficiency is ideal so iterrows probably isn't the right choice. Thank you in advance!
Here's what I have so far:
avgPrice = []
for index, row in dfA.iterrows():
def avg_unit_price(item_no, unit_price):
matchingOrders = []
for item, price in zip(item_no, unit_price):
if item == row['itemNumber']:
matchingOrders.append(price)
avgPrice.append(np.mean(matchingOrders))
avg_unit_price(dfB['item_no'], dfB['unit_price'])
dfA['avgPrice'] = avgPrice
In general, avoid loops as they perform poorly. If you can't vectorise easily, then as a last resort you can try pd.Series.apply. In this case, neither were necessary.
import pandas as pd
# B: pricing data
df_b = pd.DataFrame([['I1', 34.1], ['I2', 541.31], ['I3', 451.3], ['I2', 644.3], ['I3', 453.2]],
columns=['item_no', 'unit_price'])
# create avg price dictionary
item_avg_price = df_b.groupby('item_no', as_index=False).mean().set_index('item_no')['unit_price'].to_dict()
# A: product data
df_a = pd.DataFrame([['I1'], ['I2'], ['I3'], ['I4']], columns=['item_no'])
# map price info to product data
df_a['avgPrice'] = df_a['item_no'].map(item_avg_price)
# remove unmapped items
df_a = df_a[pd.notnull(df_a['avgPrice'])]