Looking for faster loops for big dataframes - python

I have a very simple loop that just takes too long to iterate over my big dataframe.
value = df.at[n,'column_A']
for y in range(0,len(df)):
index=df[column_B.ge(value_needed)].index[y]
if index_high > n:
break
With this, I'm trying to find the first index that has a value greater than value_needed. The problem is that this loop is just too inneficent to run when len(df)>200000
Any ideas on how to solve this issue?

In general you should try to avoid loops with pandas, here is a vectorized way to get what you want:
df.loc[(df['column_B'].ge(value_needed)) & (df.index > n)].index[0]

I wish you have sample data. Try this on your data and let me know what you get
import numpy as np
index = np.where(df[column_B] > value_needed)[0].flat[0]
Then
#continue with other logic

Related

Comparing values in Dataframes

I am doing a Python project and trying to cut down on some computational time at the start using Pandas.
The code currently is:
for c1 in centres1:
for c2 in centres2:
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])
I am trying to put centres1 and centres2 into data frames then compare each value to each other value. Would pandas help me cut some time off it? (currently 2 mins). If not how could I work around it?
Thanks
Unfortunately this is never going to be fast as you are going to be performing n squared operations. For example if you are comparing n objects where n = 1000 then you only have 1 million comparisons. If however you have n = 10_000 then you 100 million comparisons. A problem 10x bigger becomes 100 times slower.
nevertheless, for loops in python are relatively expensive. Using a library like pandas may mean that you can only make one function call and will shave some time off. Without any input data it's hard to assist further but the below should provide some building blocks
import pandas
df1 = pandas.Dataframe(centres1)
df2 = pandas.Dataframe(centres2)
df3 = df1.merge(df2, how = 'cross')
df3['combined_centre'] = ((df3['0_x']-df2['0_y']**2 + (df1['1_x']-df['1_y'])**2)
df3[df3['prod'] > search_rad**2
Yes, for sure pandas will help in cutting-off atleast some time which will be less that what you are getting right now, but you can try this out:
for i,j in zip(center1, center2):
if ((c1[0]-c2[0])**2 + (c1[1]-c2[1])**2) < search_rad*search_rad:
possible_commet.append([c1,c2])

deleting loops to increase efficiency in python

How do I make this more efficient? I feel like I should be able to do this without looping through the entire dataframe. Basically I have to split the column CollectType, into multiple columns depending on the the value in column SSampleCode.
for i in range(0,len(df)):
if df.SSampleCode[i]=='Rock':
df.R_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Soil':
df.S_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Pan Con':
df.PC_SampleType[i]=df.CollectType[i]
elif df.SSampleCode[i]=='Silt':
df.SS_SampleType[i]=df.CollectType[i]
This can be done using masks (vectorial approach):
for i in range(0,len(df)):
if df.SSampleCode[i]=='Rock':
df.R_SampleType[i]=df.CollectType[i]
will be
mask = df.SSampleCode=='Rock'
df.R_SampleType[mask] = df.CollectType[mask]
This will give you a good perf improvement.

Replacing Values in Dataframe Based on Condition

I am using a Dataframe in python that has a column of percentages. I would like to replace values that are greater than 50% with 'Likely' and less than with 'Not-Likely'.
Here are the options I found:
df.apply
df.iterrows
df.where
This works for the df.iterrows:
for index, row in df.iterrows():
if row['Chance']>0.50:
df.loc[index, 'Chance']='Likely'
else:
df.loc[index, 'Chance']='Not-Likely'
However, I have read that this is not an optimal way of 'updating' values.
How would you do this using the other methods and which one would you recommend? Also, if you know any other methods, please share! Thanks
Give this a shot.
import numpy as np
df['Chance'] = np.where(df['Chance'] > 0.50, 'Likely', 'Not-Likely')
This will however make anything = to .50 as 'Not-Likely'
Just as a side note, .itertuples() is said to be about 10x faster than .iterrows(), and zip about 100x faster.

Search for elements by timestamp in a sorted pandas dataframe

I have a very large pandas dataframe/series with millions of elements.
And I need to find all the elements for which timestamp is < than t0.
So normally what I would do is:
selected_df = df[df.index < t0]
This takes ages. As I understand when pandas searches it goes through every element of the dataframe. However I know that my dataframe is sorted hence I can break the loop as soon as the timestamp is > t0. I assume pandas doesn't know that dataframe is sorted and searches through all timestamps.
I have tried to use pandas.Series instead - still very slow.
I have tried to write my own loop like:
boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
tsearch = ticks_time_list[boudery]
boudery += 1
selected_df = df[:boudery]
This takes even longer than pandas search.
The only solution I can see atm is to use Cython.
Any ideas how this can be sorted without C involved?
It doesn't really seem to take ages for me, even with a long frame:
>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop
But if we're really trying to squeeze out every drop of performance, we can use the searchsorted method after dropping down to numpy:
>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True
which is many times faster.
(I'm not very familiar with Pandas, but this describes a very generic idea - you should be able to apply it. If necessary, adapt the Pandas-specific functions.)
You could try to use a more efficient search. At the moment you are using a linear search, going through all the elements. Instead, try this
ticks_time_list=df.index
tsearch_min = 0
tsearch_max = len(ticks_time_list)-1 #I'm not sure on whether this works on a pandas dataset
while True:
tsearch_middle = int((tsearch_max-tsearch_min)/2)
if ticks_time_list[tsearch_middle] < t0:
tsearch_min = tsearch_middle
else:
tsearch_max = tsearch_middle
if tsearch_max == tsearch_min:
break
# tsearch_max == tsearch_min and is the value of the index you are looking for
Instead of opening every single element, and looking at the time stamp, it instead tries to find the "boundary" by always narrowing down the search space by cutting it into half.

Apply formula to specific numpy array values

I have a 2 dimensional array in numpy and need to apply a mathematical formula just to some values of the array which match certain criteria. This can be made using a for loop and if conditions however I think using numpy where() method works faster.
My code so far is this but it doesn't work
cond2 = np.where((SPN >= -alpha) & (SPN <= 0))
SPN[cond2] = -1*math.cos((SPN[cond2]*math.pi)/(2*alpha))
The values in the orginal array need to be replaced with the corresponding value after applying the formula.
Any ideas of how to make this work? I'm working with big arrays so need and efficient way of doing it.
Thanks
Try this:
cond2 = (SPN >= -alpha) & (SPN <= 0)
SPN[cond2] = -np.cos(SPN[cond2]*np.pi/(2*alpha))

Categories

Resources