optimizing nested for loop wrt time taken - python

I have a dictionary of dictionaries. Let's say 'data' and a numpy array. Let's say 'stats'.
i am trying to check whether:-
first and second columns of the numpy array exist in a range of 2 keys each in my dictionary of dictionaries OR if those 2 keys exist in range of columns in my numpy array.
Providing my code for reference
The main issue is this is taking a lot of time would really appreciate any help on making this run faster.
Any help will be appreciated, Thank you
final = []
for x,y,w,h,area in stats[:]:
valid = True
if any([(x in range(s["hpos_start"]-2, s["hpos_end"] + 2) and y in range(s["vpos_start"]-2, s["vpos_end"] + 2)) or ((int(s['hpos_start']) in range(x,x+w) and int(s['vpos_start']) in range(y,y+h))) for _, s in data.items()]):
valid = False
if valid:
final.append([x,y,w,h])
sample for stats =
[[ 246 1102 1678 2214 172182],
[ 678 1005 1688 2214 3528850],
[ 1031 241 17 23 331]]
sample for data =
{'0': {
'hpos_start': 244,
'hpos_end': 296,
'vpos_start': 1099,
'vpos_end': 3898,
},
'1': {
'hpos_start': 679,
'hpos_end': 952,
'vpos_start': 231,
'vpos_end': 281
},
'2': {'hpos_start': 1077,
'hpos_end': 1174,
'vpos_start': 231,
'vpos_end': 281
}}
stats is about size (352,5)
data is about size 212
can be more than above as well

My suggestion would be to turn both stats and data into numpy arrays and then figuring out a way to achieve your particular filtering without the need to use explicit for-loops. That's the whole advantage of numpy! Also then you'd use special indexing methods to generate your final array instead of building it piece by piece. Appending to a list can be somewhat slow...
For a small but easy-to-implement speedup: When you use any or all in your code, you should avoid passing it a list when you can pass it a generator expression instead. If you just remove the square brackets from inside the any, you should see a little speedup, because you will avoid always building the full intermediate list! The cool thing about any (and all) is that, when working on iterators, they have what's called short-circuiting: As soon as any finds an item that's True, it knows it can just stop looking at the rest of the items because the answer will be true. Likewise as soon as all finds an item that's False it will stop looking at the rest and just return False.
But really, turn your inputs into numpy arrays (or maybe a numpy array and a pandas dataframe) and then try to figure out a way to avoid for-loops.

To test and time your code, I run it in a for loop 100000 times. Your code runs in 1.406 seconds. I propose the code below: replace "in range" tests by limit tests, supress int(..) casts, and it runs in 0.609 seconds on my pc. Have a try on your pc and see what speed you can gain :
import time
start = time.process_time()
for i in range(100000) :
## 100000x 0.609
final = []
for x,y,h,w,area in stats:
if any([
((
(s["hpos_start"]-2) <= x and x <= (s["hpos_end"] + 2)
and
((s["vpos_start"]-2)<=y and y <=( s["vpos_end"] + 2))
))
or ((
(x<= s['hpos_start'] <= (x+w))
and
(y<= s['vpos_start'] <=(y+h))
))
for s in data.values()
]):
pass
else:
final.append([x,y,w,h])
print(time.process_time() - start)

Related

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))
I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

I need to make my program nested loops works simpler, since the operating time is maximum

I am a learner in nested loops in python.
Problem:
Below I have written my code. I want to make my code simpler, since when I run the code it takes so much time to produce the result.
My code:
I have a list which contains 1000 values:
Brake_index_values = [ 44990678, 44990679, 44990680, 44990681, 44990682, 44990683,
44997076, 44990684, 44997077, 44990685,
...
44960673, 8195083, 8979525, 100107546, 11089058, 43040161,
43059162, 100100533, 10180192, 10036189]
I am storing the element no 1 in another list
original_top_brake_index = [Brake_index_values[0]]
I created a temporary list called temp and a numpy array for iteration through Loop:
temp =[]
arr = np.arange(0,1000,1)
Loop operation:
for i in range(1, len(Brake_index_values)):
if top_15_brake <= 15:
a1 = Brake_index_values[i]
#a2 = Brake_index_values[j]
a3 = arr[:i]
for j in a3:
a2 = range(Brake_index_values[j] - 30000, Brake_index_values[j] + 30000)
if a1 in a2:
pass
else:
temp.append(a1)
if len(temp)== len(a3):
original_top_brake_index.append(a1)
top_15_brake += 1
del temp[:]
else:
del temp[:]
continue
What i did in the code:
I am comparing the Brake_index_values[1] element available between the range of 30000 before and after Brake_index_values[0] element, that is range(Brake_index_values[0]-30000, Brake_index_values[0]+30000)`.
If the Brake_index_values[1] available between the range, I should ignore that element and go for the next element Brake_index_values[2] and follow the same process as before for Brake_index_values[0] & Brake_index_values[1]
If it is available, store the Value, in original_top_brake_index thorough append operation.
The result I get:
It is working, but it takes so much time to complete the operation and sometimes it shows MemoryError.
Requirement:
I just want my code to work simpler and efficient with simple operations.
Request:
I am not a good coder, anyway I am sure that there will be some easy way to do the above process. Kindly shed some light to avoid this problem or a new way to approach.
You can have a look at numpy.where
(https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.where.html)
to get through this problem. Your code will then look like:
BIV = np.array(Brake_index_values) # shortening for convenience
ref_val = BIV[0]
req_indicies, = np.where((BIV < ref_val-3e4) | (BIV > ref_val+3e4))
req_array = BIV[req_indicies]
This should give you an array of all the values passing the condition which you can further use.

Speed up parameter testing using Dask

I have a time series dataframe with about 10 columns where I am performing manipulations on the time series to return results of strategy data. I would like to test 2 parameters as they may or may not effect each other. When tested independently, each run take over 10 sec per unit(over 6.5 hours for the total run) and I'm looking to speed this up..I have been reading about dask and it seems that its the right module to use.
My current code iterates over each parameter range with a nested loops. I know it can be paralleled as the data per day is mutually exclusive.
Here is the code:
amount1=np.arange(.001,.03,.0005)
amount2=np.arange(.001,.03,.0005)
def getResults(df,amount1,amount2):
final_results=[]
for x in tqdm(amount1):
for y in amount2:
df1=None
df1=function1(df.copy(), x, y ) #takes about 2sec.
df1=function2(df1) #takes about 2sec.
df1=function3(df1) #takes about 3sec.
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
UPDATE:
So it looks like the improvements should come by adjusting the function to remove the iteration from the calls and to create a list of jobs(my understanding. Here is where I am so far. I probably will need to move my df to a dask dataframe, so that the data can be chunked into smaller pieces. The question is do I leave the function1,2 and 3 functions as pandas vector manulipulations or do they need to move to complete dask functions?
def getResults(df,amount):
df1=None
df1=dsk.delayed(function1)(df,amount[0],amount[1] )
df1=dsk.delayed(function2)(df1)
df1=dsk.delayed(function2)(df1)
return [amount[0],amount[1],df1['results'].iloc[-1]]
#Create a list of processes from jobs. jobs is a list of tuples that replaces the iteration.
processes =[getResults(df,items) for items in jobs]
#Create a process list of results
results=[]
for i in range(len(processes):
results.append(processes[i])
You probably want to use either dask.delayed or the concurrent.futures interface.
Something like the following would probably work well (untested, I recommend that you read the docs referenced above to understand what it's doing).
def getResults(df,amount1,amount2):
final_results=[]
for x in amount1:
for y in amount2:
df1=None
df1=dask.delayed(function1)(df.copy(), x, y )
df1=dask.delayed(function2)(df1)
df1=dask.delayed(function3)(df1)
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
out = getResults(df, amount1, amount2)
result = delayed(out).compute()
Also, I would avoid calling df.copy() if you can avoid it. Ideally function1 would not mutate input data.

python2 to 3 use of list()

I'm converting python2.7 scripts to python3.
2to3 makes these kinds of suggestions:
result = result.split(',')
syslog_trace("Result : {0}".format(result), False, DEBUG)
- data.append(map(float, result))
+ data.append(list(map(float, result)))
if (len(data) > samples):
data.pop(0)
syslog_trace("Data : {0}".format(data), False, DEBUG)
# report sample average
if (startTime % reportTime < sampleTime):
- somma = map(sum, zip(*data))
+ somma = list(map(sum, list(zip(*data))))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
I'm sure the makers of Python3 did not want to do it like that. At least, it gives me a "you must be kidding" feeling.
Is there a more pythonic way of doing this?
What's wrong with the unfixed somma?
2to3 cannot know how somma is going to be used, in that case, as a generator in the next line to compute averages it is OK and optimal, no need to convert it as a list.
That's the genius of python 3 list to generator changes: most people used those lists as generators already, wasting precious memory materializing lists they did not need.
# report sample average
if (startTime % reportTime < sampleTime):
somma = map(sum, zip(*data))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
Of course the first statement, unconverted, will fail since we append a generator whereas we need a list. In that case, the error is quickly fixed.
If left like this: data.append(map(float, result)), the next trace shows something fishy: 'Data : [<map object at 0x00000000043DB6A0>]', that you can quickly fix by cnverting to list as 2to3 suggested.
2to3 does its best to create running Python 3 code, but it does not replace manual rework or produce optimal code. When you are in a hurry you can apply it, but always check the diffs vs the old code like the OP did.
The -3 option of latest Python 2 versions print warnings when an error would be raised using Python 3. It's another approach, better when you have more time to perform your migration.
I'm sure the makers of Python3 did not want to do it like that
Well, the makers of Python generally don't like seeing Python 2 being used, I've seen that sentiment being expressed in pretty much every recent PyCon.
Is there a more pythonic way of doing this?
That really depends on your interpretation of Pythonic, list comps seem more intuitive in your case, you want to construct a list so there's no need to create an iterator with map or zip and then feed it to list().
Now, why 2to3 chose list() wrapping instead of comps, I do not know; probably easiest to actually implement.

Categories

Resources