I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.
I have a dataframe which consists of 6K items, each item having 70 fields to describe it. So the length of my dataframe is around 420K rows. Then I apply my function like this:
df_dirty[['basic_score', 'additional_score']] = df_dirty.apply(compare.compare, axis=1)
Compare function takes the row from df_dirty and then takes and ID from that row, depending on which it takes other two cells from that row and performs a comparison of those two cells. The comparison may be a simple
if cell1 == cell2:
return True
else:
return False
or a more difficult calculation that takes the values of those cells and then calculates if their ratio is in some range or whatever.
Overall - the function I apply to my dataframe is performing some more actions so it's very time consuming for large datasets of complex data(not only clean numbers, but number and text combinations, etc).
I was wondering if there are any faster ways to do this than simply applying a function?
I have some ideas about what should I do with this:
Put everything on a server and perform all calculations overnight, so it would be faster to just ask for an already calculated result,
also I thought that this would maybe be faster if I used C to write my compare function. What are my other options?