How do I make this more efficient? I feel like I should be able to do this without looping through the entire dataframe. Basically I have to split the column CollectType, into multiple columns depending on the the value in column SSampleCode.
for i in range(0,len(df)):
if df.SSampleCode[i]=='Rock':
df.R_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Soil':
df.S_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Pan Con':
df.PC_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Silt':
df.SS_SampleType[i]=df.CollectType[i]
This can be done using masks (vectorial approach):
for i in range(0,len(df)):
if df.SSampleCode[i]=='Rock':
df.R_SampleType[i]=df.CollectType[i]
will be
mask = df.SSampleCode=='Rock'
df.R_SampleType[mask] = df.CollectType[mask]
This will give you a good perf improvement.
Related
I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.
def labeler(input_df,output_df):
labels = ['Fruit','Vegetable','Meat','Other']
for i in tqdm(range(len(input_df))):
temp = classifier(input_df['description'][i],labels)
output ={'work_order_num':input_df['order_num'][i],
'work_order_desc':input_df['description'][i],
'label':temp['labels'][0],
'score':temp['scores'][0]}
output_df.append(output)
In terms of speed and resources would it be better to shape this function with lambda?
Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).
I suggest doing something like this:
output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description'] # these columns can be copied as whole.
def classification(df_desc):
temp = classifier(df_desc, labels)
return temp['labels'][0], temp['scores'][0]
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))
classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.
Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.
I've got a DataFrame _my_df with a shape (5187470, 109) and I need to iterate it row by row to perform some tasks. One of the thing that should be done en each iteration is to find a string value in one of its columns, get the corresponding rows and use them in the tasks. This is a bottleneck and makes the execution too slow.
I've tried the following approaches without success:
# Some previous manipulation for the approaches
_grp = _my_df.groupby(_my_df['target'])
_records = _my_df.to_records(index=False)
# First approach as pandas usual - Very slow
df_final = _my_df[_my_df['target'] == 'value']
# Second approach by grouping - Very slow
df_final = _grp.get_group('value')
# Third approach using numpy - A little faster
df_final = _my_df.to_numpy()[_my_df['target'].to_numpy() == 'value']
# Fourth approach using recarray - A little faster
df_final = pd.DataFrame.from_records(_records[np.where(_records['target'] == 'value')])
But none of them are fast enough...Do you know any other approach I could use here?
Thank you very much in advance
I have a very simple loop that just takes too long to iterate over my big dataframe.
value = df.at[n,'column_A']
for y in range(0,len(df)):
index=df[column_B.ge(value_needed)].index[y]
if index_high > n:
break
With this, I'm trying to find the first index that has a value greater than value_needed. The problem is that this loop is just too inneficent to run when len(df)>200000
Any ideas on how to solve this issue?
In general you should try to avoid loops with pandas, here is a vectorized way to get what you want:
df.loc[(df['column_B'].ge(value_needed)) & (df.index > n)].index[0]
I wish you have sample data. Try this on your data and let me know what you get
import numpy as np
index = np.where(df[column_B] > value_needed)[0].flat[0]
Then
#continue with other logic
Not sure if this is a good idea after all, but having a dictionary with arrays as values, such as
DF = {'z_eu': array([127.45064758, 150.4478288 , 150.74781189, -98.3227338 , -98.25155681, -98.24993753]),
'Process': array(['initStep', 'Transportation', 'Transportation', 'Transportation', 'Transportation', 'phot']),
'Creator': array(['SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad']) }
I need to do a selection of the numeric data (z_eu) based on values of the other two keys.
One workaround I came up with so far, was to extract the arrays and iterate through them, thereby creating another array which contains the valid data.
proc = DF['Process']; z= DF['z_eu']; creat = DF['Creator']
data = [z for z,p,c in zip(z, proc,creat) if (p == 'initStep') and c=='SynRad' ]
But somehow this seems to me as effort which can be completely avoided by dealing more intelligently with the dictionary in the first place? Also, the zip() takes a long time as well.
I know that dataframes are a valid alternative but unfortunately, since I'm dealing with strings, pandas appears to be too slow.
Any hints are most welcome!
A bit simpler, using conditional slicing you could write
data = DF['z_eu'][(DF['Process'] == 'initStep') & (DF['Creator'] == 'SynRad')]
...or still using zip, you could simplify to
data = [z for z, p, c in zip(*DF.values()) if p == 'initStep' and c == 'SynRad']
Basically also conditional slicing, using a pandas DataFrame:
df = pd.DataFrame(DF)
data = df.loc[(df['Process'] == 'initStep') & (df['Creator'] == 'SynRad'), 'z_eu']
print(data)
# 0 127.450648
# Name: z_eu, dtype: float64
In principle I'd say there's nothing wrong with handling numpy arrays in a dict. You'll have a lot of flexibility and sometimes operations are more efficient if you do them straight in numpy (you could even utilize numba for purely numerical, expensive calculations) - but if that is not needed and you're fine with basically a n*m table, pandas dfs are nice and convenient.
If your dataset is large and you want to perform many look-ups as the one shown, you might not want to perform those on strings. To improve performance, you could e.g. come up with unique IDs (integers) for each 'Process' or 'Creator' from the example. You'll just need to be able to map those back to the original strings, so keep that data as well.
You can loop through one array and via the index get the right element
z_eu = DF['z_eu']
process = DF['Process']
creator = DF['Creator']
result = []
for i in range(len(z_eu)):
if process[i] == 'initStep' and creator[i] == 'SynRad':
result.append(z_eu[i])
print(result)
I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random
def normalize(data, expectation):
"""Normalize data by duplicating existing rows"""
counts = data[expectation].value_counts()
max_count = int(counts.max())
for tag, group in data.groupby(expectation, sort=False):
array = pandas.DataFrame(columns=data.columns.values)
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
i = max_count % counts[tag]
if i > 0:
array = pandas.concat([array, group.ix[random.sample(group.index, i)]])
data = pandas.concat([data, array])
return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?
There are a couple of things that stand out.
To begin with, the loop
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:
stuff = []
for tag, group in data.groupby(expectation, sort=False):
# Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)