I have the following workflow in a Python notebook
Load data into a pandas dataframe from a table (around 200K rows) --> I will call this orig_DF moving forward
Manipulate orig_DF to get into a DF that has columns <Feature1, Feature 2,...,Feature N, Label> --> I will call this derived DF ```ML_input DF`` moving forward. This DF is used to train a ML model
To get ML_input DF, I need to do some complex processing on each row in orig_DF. In particular, each row in orig_DF gets converted into multiple "rows" (number unknown before processing a row) in ML_input DF
Currently, I am doing (code below)
orig_df.iterrows() to loop through each row
Apply a function on each row. This returns a list.
Accumulate results from multiple rows into one list
Convert this list into ML_input DF after the loop ends
This works but I want speed this up by parallelizing the work on each row and accumulating the results. Would appreciate pointers from Pandas experts on how to do this. An example would be greatly appreciated
Current code is below.
Note: I have looked into using df.apply(). But two issues seem to be
apply in itself does not seem to parallelize things.
I don't how to make apply handle this one row converted to multiple row issue (any pointers here will also help)
Current code
def get_training_dataframe(dfin):
X = []
for index, row in dfin.iterrows():
ts_frame_dict = ast.literal_eval(row["sample_dictionary"])
for ts, frame in ts_frame_dict.items():
features = get_features(frame)
if features != None:
X += [features]
return pd.DataFrame(X, columns=FEATURE_NAMES)
It's difficult to know what optimizations are possible without having example data and without knowing what get_features() does.
The following code ought to be equivalent (I think) to your code, but it attempts to "vectorize" each step instead of performing it all within the for-loop. Perhaps that will offer you a chance to more easily measure the time taken by each step, and optimize the bottlenecks.
In particular, I wonder if it's faster to combine the calls to ast.literal_eval() into a single call. That's what I've done here, but I have no idea if it's truly faster.
I recommend trying line profiler if you can.
import ast
import pandas as pd
def get_training_dataframe(dfin):
frame_dicts = ast.literal_eval('[' + ','.join(dfin['sample_dictionary']) + ']')
frames = chain(*(d.values() for d in frame_dicts))
features = map(get_features, frames)
features = [f for f in features if f is not None]
return pd.DataFrame(features, columns=FEATURE_NAMES)
Related
I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.
def labeler(input_df,output_df):
labels = ['Fruit','Vegetable','Meat','Other']
for i in tqdm(range(len(input_df))):
temp = classifier(input_df['description'][i],labels)
output ={'work_order_num':input_df['order_num'][i],
'work_order_desc':input_df['description'][i],
'label':temp['labels'][0],
'score':temp['scores'][0]}
output_df.append(output)
In terms of speed and resources would it be better to shape this function with lambda?
Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).
I suggest doing something like this:
output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description'] # these columns can be copied as whole.
def classification(df_desc):
temp = classifier(df_desc, labels)
return temp['labels'][0], temp['scores'][0]
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))
classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.
Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.
Problem statement
I had the following problem:
I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for
DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced.
This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution?
If you think there is no better alternative, please comment and I'll open an issue...
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!
Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!
I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.