Improving loop in loops with Numpy - python

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))

I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

Related

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).

I need to make my program nested loops works simpler, since the operating time is maximum

I am a learner in nested loops in python.
Problem:
Below I have written my code. I want to make my code simpler, since when I run the code it takes so much time to produce the result.
My code:
I have a list which contains 1000 values:
Brake_index_values = [ 44990678, 44990679, 44990680, 44990681, 44990682, 44990683,
44997076, 44990684, 44997077, 44990685,
...
44960673, 8195083, 8979525, 100107546, 11089058, 43040161,
43059162, 100100533, 10180192, 10036189]
I am storing the element no 1 in another list
original_top_brake_index = [Brake_index_values[0]]
I created a temporary list called temp and a numpy array for iteration through Loop:
temp =[]
arr = np.arange(0,1000,1)
Loop operation:
for i in range(1, len(Brake_index_values)):
if top_15_brake <= 15:
a1 = Brake_index_values[i]
#a2 = Brake_index_values[j]
a3 = arr[:i]
for j in a3:
a2 = range(Brake_index_values[j] - 30000, Brake_index_values[j] + 30000)
if a1 in a2:
pass
else:
temp.append(a1)
if len(temp)== len(a3):
original_top_brake_index.append(a1)
top_15_brake += 1
del temp[:]
else:
del temp[:]
continue
What i did in the code:
I am comparing the Brake_index_values[1] element available between the range of 30000 before and after Brake_index_values[0] element, that is range(Brake_index_values[0]-30000, Brake_index_values[0]+30000)`.
If the Brake_index_values[1] available between the range, I should ignore that element and go for the next element Brake_index_values[2] and follow the same process as before for Brake_index_values[0] & Brake_index_values[1]
If it is available, store the Value, in original_top_brake_index thorough append operation.
The result I get:
It is working, but it takes so much time to complete the operation and sometimes it shows MemoryError.
Requirement:
I just want my code to work simpler and efficient with simple operations.
Request:
I am not a good coder, anyway I am sure that there will be some easy way to do the above process. Kindly shed some light to avoid this problem or a new way to approach.
You can have a look at numpy.where
(https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.where.html)
to get through this problem. Your code will then look like:
BIV = np.array(Brake_index_values) # shortening for convenience
ref_val = BIV[0]
req_indicies, = np.where((BIV < ref_val-3e4) | (BIV > ref_val+3e4))
req_array = BIV[req_indicies]
This should give you an array of all the values passing the condition which you can further use.

How is timeit affected by the length of a list literal?

Update: Apparently I'm only timing the speed with which Python can read a list. This doesn't really change my question, though.
So, I read this post the other day and wanted to compare what the speeds looked like. I'm new to pandas so any time I see an opportunity to do something moderately interesting, I jump on it. Anyway, I initially just tested this out with 100 numbers, thinking that would be sufficient to satisfy my itch to play with pandas. But this is what that graph looked like:
Notice that there are 3 different runs. These runs were run in sequential order, but they all had a spike at the same two spots. The spots were approximately 28 and 64. So my initial thought was it had something to do with bytes, specifically 4. Maybe the first byte contains additional information about it being a list, and then the next byte is all data and every 4 bytes after that causes a spike in speed, which kinda made sense. So I needed to test it with more numbers. So I created a DataFrame of 3 sets of arrays, each with 1000 lists ranging in length from 0-999. I then timed them all in the same manner, that is:
Run 1: 0, 1, 2, 3, ...
Run 2: 0, 1, 2, 3, ...
Run 3: 0, 1, 2, 3, ...
What I expected to see was a dramatic increase approximately every 32 items in the array, but instead there's no recurrence to the pattern(I did zoom in and look for spikes):
However, you'll notice, that they all vary a lot between the numbers 400 and 682. Oddly, 1 run always a spike in the same place making the pattern harder to distinguish in the 28 and 64 points in this graph. The green line is all over the place really. Shameful.
Question: What's happening at the initial two spikes and why does it get "fuzzy" on the graph between 400 and 682? I just finished running a test over the 0-99 sets but this time did simple addition to each item in the array and the result was exactly linear, so I think it has something to do with strings.
I tested with other methods first, and got the same results, but the graph was messed up because I joined the results wrong, so I ran it again overnight(this took a long time) using this code to make sure the times were correctly aligned with their indexes and the runs were performed in the correct order:
import statistics as s
import timeit
df = pd.DataFrame([[('run_%s' % str(x + 1)), r, np.random.choice(100, r).tolist()]
for r in range(0, 1000) for x in range(3)],
columns=['run', 'length', 'array']).sort_values(['run', 'length'])
df['time'] = df.array.apply(lambda x: s.mean(timeit.repeat(str(x))))
# Graph
ax = df.groupby(['run', 'length']).mean().unstack('run').plot(y='time')
ax.set_ylabel('Time [ns]')
ax.set_xlabel('Array Length')
ax.legend(loc=3)
I also have the dataframe pickled if you'd like to see the raw data.
You are severely overcomplicating things using pandas and .apply here. There is no need - it is simply inefficient. Just do it the vanilla Python way:
In [3]: import timeit
In [4]: setup = "l = list(range({}))"
In [5]: test = "str(l)"
Note, timeit functions take a number parameter, which is the number of times everything is run. It defaults to 1000000, so let's make that more reasonable, by using number=100, so we don't have to wait around for forever...
In [8]: data = [timeit.repeat(test, setup.format(n), number=100) for n in range(0, 10001, 100)]
In [9]: import statistics
In [10]: mean_data = list(map(statistics.mean, data))
Visual inspection of the results:
In [11]: mean_data
Out[11]:
[3.977467228348056e-05,
0.0012597616684312622,
0.002014552320664128,
0.002637979011827459,
0.0034494600258767605,
0.0046060653403401375,
0.006786816345993429,
0.006134035007562488,
0.006666974319765965,
0.0073876206879504025,
0.008359026357841989,
0.008946725012113651,
0.01020014965130637,
0.0110439983351777,
0.012085124345806738,
0.013095536657298604,
0.013812023680657148,
0.014505649354153624,
0.015109792332320163,
0.01541508767210568,
0.018623976677190512,
0.018014412683745224,
0.01837641668195526,
0.01806374565542986,
0.01866597666715582,
0.021138361655175686,
0.020885809014240902,
0.023644315680333722,
0.022424093661053728,
0.024507874331902713,
0.026360396664434422,
0.02618172235088423,
0.02721496132047226,
0.026609957004742075,
0.027632603014353663,
0.029077719994044553,
0.030218352350251127,
0.03213361800105,
0.0321545610204339,
0.032791375007946044,
0.033749551337677985,
0.03418213398739075,
0.03482868466138219,
0.03569800598779693,
0.035460735321976244,
0.03980560234049335,
0.0375820419867523,
0.03880414469555641,
0.03926491799453894,
0.04079093333954612,
0.0420664346893318,
0.044861480011604726,
0.045125720323994756,
0.04562378901755437,
0.04398221097653732,
0.04668888701902082,
0.04841196699999273,
0.047662509993339576,
0.047592316346708685,
0.05009777001881351,
0.04870589632385721,
0.0532167866670837,
0.05079756366709868,
0.05264475334358091,
0.05531930166762322,
0.05283398299555605,
0.055121281009633094,
0.056162080339466534,
0.05814277834724635,
0.05694748067374652,
0.05985202432687705,
0.05949359833418081,
0.05837553597909088,
0.05975819365509475,
0.06247356999665499,
0.061310798317814864,
0.06292542165222888,
0.06698586166991542,
0.06634997764679913,
0.06443380867131054,
0.06923895300133154,
0.06685209332499653,
0.06864909763680771,
0.06959929631557316,
0.06832000267847131,
0.07180017333788176,
0.07092387134131665,
0.07280202202188472,
0.07342300032420705,
0.0745120863430202,
0.07483605532130848,
0.0734497313387692,
0.0763389469939284,
0.07811927401538317,
0.07915793966579561,
0.08072184936221068,
0.08046915601395692,
0.08565403800457716,
0.08061318534115951,
0.08411134833780427,
0.0865995019945937]
This looks pretty darn linear to me. Now, pandas is a handy way to graph things, especially if you want a convenient wrapper around matplotlib's API:
In [14]: import pandas as pd
In [15]: df = pd.DataFrame({'time': mean_data, 'n':list(range(0, 10001, 100))})
In [16]: df.plot(x='n', y='time')
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x1102a4a58>
And here is the result:
This should get you on the right track to actually time what you've been trying to time. What you wound up timing, as I explained in the comments:
You are timing the result of str(x) which results in some list-literal,
so you are timing the interpretation of list literals, not the
conversion of list->str
I can only speculate as to the patterns you are seeing as the result of that, but that is likely interpreter/hardware dependent. Here are my findings on my machine:
In [18]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=100) for n in range(0, 10001, 100)]
And using a range that isn't so large:
In [23]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=10000) for n in range(0, 101)]
And the results:
Which I guess sort of looks like yours. Perhaps that is better suited for it's own question, though.

How do I vectorize the following loop in Numpy?

"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.

Python nest list performance choice

I am trying to understand if there is an advantage in space/time/programming to storing data from a signal processing system as nested list in either :
data[channel][sample]
data[sample][channel]
I can code processing for both - thou I personally find 1) easy to write and index to then 2).
However, 2) is the more common was my local group programs in and stores the data (either in excel/csv or from the data gathering systems). While it is easy to transpose
dataA = map(list, zip(*dataB))
I was wondering if there are any storage or performance - or even - module compatibility issues with 1 over 2?
with 1) I can loop like this
for R in dataA :
for C in R :
process_channel(C)
matplotlib.loglog(dataA[0], dataA[i])
where dataA[0] is time or frequency and i is some other channel to plot
with 2)
for R in dataB :
for C in R
process_sample(C)
matplotlib.loglog([j[0] for j in dataB],[k[i] for k in dataB])
This looks worse in programming style. Maybe I am missing a list method of making this easier? I have also developed code to used dicts ... but this really breaks with general use. So I am less inclined to continue to use dicts. Although the dict storage is
dataC = list(['f':0.1,'chnl1':100.0],['f':0.2,'chnl1':110.0])
or some such. It seems that to be better integrated option 2 is better. However, I am trying to understand how better to code when using option 2) when you wish to process over channels then samples? Just transpose the matrix first and then do the work in option 1) space and transpose back the results:
dataA = smoothing(dataA, smooth_factor)
def smoothing(d, s) :
td = numpy.transpose(d)
td = map(list, zip(*d))
nd=[]
for row in td :
col = []
for i in xrange(0,len(row)-step,step) :
col.append(sum(row[i:i+step]/step)
nd.append(col)
nd = numpy.transpose(nd)
return nd
while this construction works - transposing back and forth all the time looks - um - inefficient.

Categories

Resources