I have a dataframe which consists of 6K items, each item having 70 fields to describe it. So the length of my dataframe is around 420K rows. Then I apply my function like this:
df_dirty[['basic_score', 'additional_score']] = df_dirty.apply(compare.compare, axis=1)
Compare function takes the row from df_dirty and then takes and ID from that row, depending on which it takes other two cells from that row and performs a comparison of those two cells. The comparison may be a simple
if cell1 == cell2:
return True
else:
return False
or a more difficult calculation that takes the values of those cells and then calculates if their ratio is in some range or whatever.
Overall - the function I apply to my dataframe is performing some more actions so it's very time consuming for large datasets of complex data(not only clean numbers, but number and text combinations, etc).
I was wondering if there are any faster ways to do this than simply applying a function?
I have some ideas about what should I do with this:
Put everything on a server and perform all calculations overnight, so it would be faster to just ask for an already calculated result,
also I thought that this would maybe be faster if I used C to write my compare function. What are my other options?
Related
I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.
def labeler(input_df,output_df):
labels = ['Fruit','Vegetable','Meat','Other']
for i in tqdm(range(len(input_df))):
temp = classifier(input_df['description'][i],labels)
output ={'work_order_num':input_df['order_num'][i],
'work_order_desc':input_df['description'][i],
'label':temp['labels'][0],
'score':temp['scores'][0]}
output_df.append(output)
In terms of speed and resources would it be better to shape this function with lambda?
Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).
I suggest doing something like this:
output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description'] # these columns can be copied as whole.
def classification(df_desc):
temp = classifier(df_desc, labels)
return temp['labels'][0], temp['scores'][0]
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))
classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.
Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.
I am new to python, so I'm wondering if there is a way to make more efficient my code.
I am analyzing a dataframe that has 800 columns and 252 rows. I am analyzing each column's difference to the 799 remaining columns throughout the 252 rows, then I add up all the squared differences and store that value in a new dataframe. The result is a dataframe (symmetric matrix) that is 800x800 showing the summed squared differences throughout the 252 rows among each possible pair of columns in the dataset.
The issue is that the code took 5 hours to run, so I wanted to know if any of you have any suggestions on how to make it more efficient or maybe there is a built up function alreay? the .cov() function performs somewhat similar calculations and it takes only a few seconds to run on the same dataset so there must be a way to improve my code, which you can find below:
sqdiff_df = pd.DataFrame(columns=df.columns, index=df.columns)
# filling up the empty dataframe with the squared differences.
for c in sqdiff_df.columns:
for i in sqdiff_df.index:
ticker_c_normalisedP = year1_normalized_p[c]
ticker_i_normalisedP = year1_normalized_p[i]
sqdiff_ic = (year1_normalized_p.eval("#ticker_c_normalisedP - #ticker_i_normalisedP")**2).sum()
sqdiff_df.loc[i,c] = sqdiff_ic
You should be able to use sqdiff_ic = ((ticker_c_normalisedP - ticker_i_normalisedP)**2).sum(). The eval will be extremely slow.
I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.