Speeding up kNN function in Julia

Speeding up kNN function in Julia - python

I decided to translate some of Python code from Peter Harrington's Machine Learning in Action into Julia, starting with kNN algorithm.
After normalizing a dataset he provided, I wrote a few functions: find_kNN(), mass_kNN (a function that finds kNN for multiple inputs), and a function that splits a given dataset into randomly picked train and test datasets, calls mass_kNN(), and plots the resulting accuracy multiple times.
Then I compared the runtimes between the Julia code and the equivalent Python code. (I am using Distances in Julia to find the Euclidean distance and Gadfly to plot, but turning the plotting off doesn't do much to affect the time.)
Results:
Julia:
elapsed time: 1.175523034 seconds (455531636 bytes allocated, 47.54% gc time)
Python:
time elapsed: 0.9517326354980469 sec
I am wondering if there is a way to speed up my Julia code, or if it's running as fast as possible at this point (I mean if there are any glaring mistakes I made in terms of making the code run the fastest.)
Julia notebook
Python notebook
repo with both notebooks and the dataset
Thanks!..
Edit: Removing convert() statements and passing everything around as Real slowed down time to 2.29 sec.

So first of all, I removed the last two lines of plot_probs - plotting isn't really a great thing to benchmark I think, and it's largely beyond my (or your) control - could try PyPlot if its a real factor. I also timed plot_probs a few times to see how much time is spent compiling it the first time:
**********elapsed time: 1.071184218 seconds (473218720 bytes allocated, 26.36% gc time)
**********elapsed time: 0.658809962 seconds (452017744 bytes allocated, 40.29% gc time)
**********elapsed time: 0.660609145 seconds (452017680 bytes allocated, 40.45% gc time)
So there is a 0.3s penalty paid once. Moving on to the actual algorithm, I used the in built profiler (e.g. #profile plot_probs(norm_array, 0.25, [1:3], 10, 3)), which revealed all of the time (essentially) is spent here:
~80%: [ push!(dist, euclidean(set_array[i,:][:], input_array)) for i in 1:size(set_array, 1) ]
~20%: [ d[i] = get(d, i, 0) + 1 for i in labels[sortperm(dist)][1:k] ]
Using array comprehensions like that is not idiomatic Julia (or Python for that matter). The first is also slow because all that slicing makes many copies of the data. I'm not an expert with Distances.jl, but I think you can replace it with
dist = Distances.colwise(Euclidean(), set_array', input_array)
d = Dict{Int,Int}()
for i in labels[sortperm(dist)][1:k]
d[i] = get(d, i, 0) + 1
end
which gave me
**********elapsed time: 0.731732444 seconds (234734112 bytes allocated, 20.90% gc time)
**********elapsed time: 0.30319397 seconds (214057552 bytes allocated, 37.84% gc time)
More performance could be extracted by doing the transpose once in mass_kNN, but that required touching a few too many places and this post is long enough. Trying to micro-optimize it led me to using
dist=zeros(size(set_array, 1)
#inbounds for i in 1:size(set_array, 1)
d = 0.0
for j in 1:length(input_array)
z = set_array[i,j] - input_array[j]
d += z*z
end
dist[i] = sqrt(d)
end
which gets it to
**********elapsed time: 0.646256408 seconds (158869776 bytes allocated, 15.21% gc time)
**********elapsed time: 0.245293449 seconds (138817648 bytes allocated, 35.40% gc time)
so took about half the time off - but not really worth it, and less flexible (e.g. what if I wanted L1). Other code review points (unsolicited, I know):
I find Vector{Float64} and Matrix{Float64} much easier on the eye than Array{Float64,1} and Array{Float64,2} and less likely to be confused.
Float64[] is more usual than Array(Float64, 0)
Int64 can just be written as Int, since it doesn't need to be a 64-bit integer.

Related

Python Optimization on Looping Arrays

This is perhaps too generic, but I don't know how to ask it without TMI.
I have a python program that is creating a 6 x 12 numpy matrix with a unique set of 12 strings in the x and y direction. From what I can tell from running my program, it seems takes about 2000 iterations +/- to produce a solution. If it fails to create a unique solution or the solution provided doesn't meet some qualifications after the initial success, then it starts over.
At this point, I was just looping through the program until I had success. I tried using garbage collection, but it still just chewed through memory (I have 26 GB free) and crashed. I then moved to having subprocess.call() and just rerunning the whole program while killing off the old PID.
This has stabilized the memory consumption, but Windows is saying the Python.exe process is 500 MB and I am getting about 340 attempts per minute. I can't tell if this is good or bad. I am 15,000 attempts into my first try. I imagined it would take a ton of attempts, but not sure as to relative speed.
Does this seem slow? Have I messed up the efficiency of the matrices and using too much memory? I don't have any frame of reference on what an optimized calculation time per minute would be. I have a lot more info if anyone is interested.
Here is the main loop where the program spends the most time:
def uniquecheck(inning, position, checkplayer, checkarr)
global xcheck
uniquelist = []
if xcheck < 2000:
y=0
for row in checkplayer:
if y <= (inning-2):
uniquelist.append(checkplayer[y,position])
y = y + 1
xcheck = xcheck + 1
columns = checkplayer.shape[1]
z=0
for z in range(columns):
if z != 0:
if z <= (position-1):
uniquelist.append(checkplayer[(inning-1), z])
z = z + 1
xcheck = xcheck + 1

I am getting around the problem by testing and failing the solution faster. Instead of filling the full array and then checking it after it's created. I put in some intermediate checks and fail ones that have no chance of passing early and save all the iterations.
Not sure how to speed it up anymore, but at least it isn't 12 hours slow now. I am sure I could continue to optimize the code, but it doesn't seem like it is worth it now that it is working.

Why is the del operator in python so slow, and are there quicker alternatives?

I'm writing some driver code for an audio application. I have a buffer of audio samples, stored as ints, which I'm converting to bytes just before I send them off.
If I play a short wav file, for example, the buffer is filled to a size of about 600000 elements.
What I'm doing is converting the samples, then deleting the converted samples from the buffer, like so:
def start_playback_thread(self):
# self.alsa_buffer is a Queue type object
i = 0
begin_time = time.time()
while True:
self.buffer_mutex.acquire()
if len(self.buffer) - offset == 0:
self.buffer_mutex.release()
self.sleep_time(begin_time)
continue
# Convert samples to bytes
size = min(self.period_size, self.buffer_size)
self.alsa_buffer.put(struct.pack(
"<{}h".format(size),
*[self.correct_val(x) for x in self.buffer[:size]]
)
)
if self.buffer_size - offset < self.period_size:
self.alsa_buffer.put(b"\x00" * (self.buffer_size - self.period_size))
# Delete used samples from buffer
del self.buffer[:size] # < here
self.buffer_size -= size
self.buffer_mutex.release()
i += 1
if i > 10000:
break
logger.debug("Took {:.6f} vs target {:.6f}".format((time.time() - begin_time) / i, self.period_length_time))
The usage of i and begin_time is for timing this only.
Now, I am using a sample rate of 44100Hz and a period size of 32 frames. This means, I have to send one period every 0.000726s at a maximum.
The above code takes 0.000773s per period. This is too long. However, I removed the del operation and replaced it with a simple 'offset' so that nothing is deleted, and suddenly the time shot down to only 0.000079s - one order of magnitude quicker.
del was slowing everything down. But why?
And, since I'm not a massive fan of using an 'offset', what alternatives are there that might be quicker?

Have you tried to copy the needed part of the original list?
self.buffer = self.buffer[size:]
Using del can be slower as all objects will be destructed using the __del__ function defined in you class. That can be slow if the __del__ function slow itself (but you may not notice that by a given deletion).
Using list slicing is a way faster method as it does not replicate the objects.
The other technique you could use is a sliding windows iterator, I'm not sure if it exists in Python builtins.
Also, for a specific sized FIFO array, you can use deque from collections, or you can check if there's anything useful in collections to your specific problem.

-n and -r arguments to IPython's %timeit magic

I would like to time a code block using the timeit magic command in a Jupyter notebook. According to the documentation, timeit takes several arguments. Two in particular control number of loops and number of repetitions. What isn't clear to me is the distinction between these two arguments. For example
import numpy
N = 1000000
v = numpy.arange(N)
%timeit -n 10 -r 500 pass; w = v + v
will run 10 loops and 500 repetitions. My question is,
Can this be interpreted as the
following? (with obvious differences the actual timing results)
import time
n = 10
r = 500
T = numpy.empty(r)
for j in range(r):
t0 = time.time()
for i in range(n):
w = v + v
T[j] = (time.time() - t0)/n
print('Best time is {:.4f} ms'.format(max(T)*1000))
An assumption I am making, and may well be incorrect, is that the time for the inner loop is averaged over the n iterations through this loop. Then the best of 500 repetitions of this loop is taken.
I have searched the documentation, and haven't found anything that specifies exactly what this is doing. For example, the documentation here is
Options: -n: execute the given statement times in a loop. If this value is not given, a fitting value is chosen.
-r: repeat the loop iteration times and take the best result. Default: 3
Nothing is really said about how the inner loop is timed. The final results is the "best" of what?
The code I want to time does not involve any randomness, so I am wondering if I should set this inner loop to n=1. Then, the r repetitions will take care of any system variability.

That number and repeat are separate arguments is because they serve different purposes. The number controls how many executions are done for each timing and it's used to get representative timings. The repeat argument controls how many timings are done and its use is to get accurate statistics. IPython uses the mean or average to calculate the run-time of the statement of all repetitions and then divides that number with number. So it measures the average of the averages. In earlier versions it used the minimum time (min()) of all repeats and divided it by number and reported it as "best of".
To understand why there are two arguments to control the number and the repeats you have to understand what you're timing and how you can measure the time.
The granularity of the clock and the number of executions
A computer has different "clocks" to measure times. These clocks have different "ticks" (depending on the OS). For example it could measure seconds, milliseconds or nanoseconds - these ticks are called the granularity of the clock.
If the duration of the execution is smaller or roughly equal to the granularity of the clock one cannot get representative timings. Suppose your operation would take 100ns (=0.0000001 seconds) but the clock only measures milliseconds (=0.001 seconds) then most measurements would measure 0 milliseconds and a few would measure 1 millisecond - which one depends on where in the clock cycle the execution started and finished. That's not really representative of the duration of what you want to time.
This is on Windows where time.time has a granularity of 1 millisecond:
import time
def fast_function():
return None
r = []
for _ in range(10000):
start = time.time()
fast_function()
r.append(time.time() - start)
import matplotlib.pyplot as plt
plt.title('measuring time of no-op-function with time.time')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This shows the histogram of the measured times from this example. Almost all measurements were 0 milliseconds and three measurements which were 1 millisecond:
There are clocks with a much lower granularity on Windows, this was just to illustrate the effect of the granularity and each clock has some granularity even if it's lower than one millisecond.
To overcome the restriction of the granularity one can increase the number of executions so the expected duration is significantly higher than the granularity of the clock. So instead of running the execution once it's run number times. Taking the numbers from above and using a number of 100 000 the expected run-time would be =0.01 seconds. So neglecting everything else the clock would now measure 10 milliseconds in almost all cases, which would accurately resemble the expected execution time.
In short specifying a number measures the sum of number executions. You need to divide the times measure this way by number again to get the "time per execution".
Other processes and the repeatitions of the execution
Your OS typically has a lot of active processes, some of them can run in parallel (different processors or using hyper-threading) but most of them run sequentially with the OS scheduling times for each process to run on the CPU. Most clocks don't care what process currently runs so the measured time will be different depending on the scheduling plan. There are also some clocks that instead of measuring system time measure the process time. However they measure the complete time of the Python process, which will sometimes includes a garbage collection or other Python threads - besides that the Python process isn't stateless and not every operation will be always exactly the same, and there are also memory allocations/re-allocations/clears happening (sometimes behind the scenes) and these memory operations times can vary depending on a lot of reasons.
Again I use a histogram measuring the time it takes to sum ten thousand ones on my computer (only using repeat and setting number to 1):
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This histogram shows a sharp cutoff at just below ~5 milliseconds, which indicates that this is the "optimal" time in which the operation can be executed. The higher timings are measurements were the conditions were not optimal or other processes/threads took some of the time:
The typical approach to avoid these fluctuations is to repeat the number of timings very often and then use statistics to get the most accurate numbers. Which statistic depends on what you want to measure. I'll go into this in more detail below.
Using both number and repeat
Essentially the %timeit is a wrapper over timeit.repeat which is roughly equivalent to:
import timeit
timer = timeit.default_timer()
results = []
for _ in range(repeat):
start = timer()
for _ in range(number):
function_or_statement_to_time
results.append(timer() - start)
But %timeit has some convenience features compared to timeit.repeat. For example it calculates the best and average times of one execution based on the timings it got by repeat and number.
These are calculated roughly like this:
import statistics
best = min(results) / number
average = statistics.mean(results) / number
You could also use TimeitResult (returned if you use the -o option) to inspect all results:
>>> r = %timeit -o ...
7.46 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
>>> r.loops # the "number" is called "loops" on the result
100000000
>>> r.repeat
7
>>> r.all_runs
[0.7445439999999905,
0.7611092000000212,
0.7249667000000102,
0.7238135999999997,
0.7385598000000186,
0.7338551999999936,
0.7277425999999991]
>>> r.best
7.238135999999997e-09
>>> r.average
7.363701571428618e-09
>>> min(r.all_runs) / r.loops # calculated best by hand
7.238135999999997e-09
>>> from statistics import mean
>>> mean(r.all_runs) / r.loops # calculated average by hand
7.363701571428619e-09
General advise regarding the values of number and repeat
If you want to modify either number or repeat then you should set number to the minimum value possible without running into the granularity of the timer. In my experience number should be set so that number executions of the function take at least 10 microseconds (0.00001 seconds) otherwise you might only "time" the minimum resolution of the "timer".
The repeat should be set as high as possible. Having more repeats will make it more likely that you really find the real best or average. However more repeats will take longer so there's a trade-off as well.
IPython adjusts number but keeps repeat constant. I often do the opposite: I adjust number so that the number executions of the statement take ~10us and then I adjust the repeat that I get a good representation of the statistics (often it's in the range 100-10000). But your mileage may vary.
Which statistic is best?
The documentation of timeit.repeat mentions this:
Note
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
For example one typically wants to find out how fast the algorithm can be, then one could use the minimum of these repetitions. If one is more interested in the average or median of the timings one can use those measurements. In most cases the number one is most interested in is the minimum, because the minimum resembles how fast the execution can be - the minimum is probably the one execution where the process was least interrupted (by other processes, by GC, or had the most optimal memory operations).
To illustrate the differences I repeated the above timing again but this time I included the minimum, mean, and median:
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import numpy as np
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto', color='black', label='measurements')
plt.tight_layout()
plt.axvline(np.min(r), c='lime', label='min')
plt.axvline(np.mean(r), c='red', label='mean')
plt.axvline(np.median(r), c='blue', label='median')
plt.legend()
Contrary to this "advise" (see quoted documentation above) IPythons %timeit reports the average instead of the min(). However they also only use a repeat of 7 by default - which I think is too less to accurately determine the minimum - so using the average in this case is actually sensible.It's a great tool to do a "quick-and-dirty" timing.
If you need something that allows to customize it based on your needs, one could use timeit.repeat directly or even a 3rd party module. For example:
pyperf
perfplot
simple_benchmark (my own library)

It looks like the latest version of %timeit is taking the average of the r n-loop averages, not the best of the averages.
Evidently, this has changed from earlier versions of Python. The best time of r averages can still be obtained via the TimeResults return argument, but it is no longer the value that is displayed.
Comment : I recently ran this code from above and found that the following syntax no longer works :
n = 1
r = 50
tr = %timeit -n $n -r $r -q -o pass; compute_mean(x,np)
That is, it is no longer possible (it seems) to use $var to pass a variable to the timeit magic command. Does this mean that this magic command should be retired and replaced with the timeit module?
I am using Python 3.7.4.

Why inverse for loop in matlab is faster

I recently came across this for loop code in MATLAB that confused me because the inverse loop do the same thing faster. why this happens?
clear all
a = rand(1000,1000);
b = rand(1000,1000);
for i=1:1000
for j=1:1000
c(i,j) = a(i,j) + b(i,j);
end
end
and the same code with inverse loop:
clear all
a = rand(1000,1000);
b = rand(1000,1000);
for i=1000:-1:1
for j=1000:-1:1
c(i,j) = a(i,j) + b(i,j);
end
end
i do the same in python with range(1000,1,-1) and found the same result(the inverse loop is still faster).

Since you did not preallocate your output variable c when you go in the reverse order c is initially preallocated to a 1000 x 1000 matrix after the first for loop iteration. When you count up c increases in size each loop which requires reallocation of memory on each iteration and thus is slower. Matlab will show this as a warning if you have them turned on.

The inverse loop is faster, because the first iteration (c(1000,1000)=..) creates an array of size 1000x1000 while first piece of code continuously increases the size of the variable c.
To avoid such problems, preallocated the variables you write in loops. Insert a c=zeros(1000,1000) and both versions run fast. Your matlab editor shows you warnings (yellow lines) which indicate potential performance problems and other problems with your code. Read these messages!

how to implement a really efficient bitvector sorting in python

Actually this is an interesting topic from programming pearls, sorting 10 digits telephone numbers in a limited memory with an efficient algorithm. You can find the whole story here
What I am interested in is just how fast the implementation could be in python. I have done a naive implementation with the module bitvector. The code is as following:
from BitVector import BitVector
import timeit
import random
import time
import sys
def sort(input_li):
return sorted(input_li)
def vec_sort(input_li):
bv = BitVector( size = len(input_li) )
for i in input_li:
bv[i] = 1
res_li = []
for i in range(len(bv)):
if bv[i]:
res_li.append(i)
return res_li
if __name__ == "__main__":
test_data = range(int(sys.argv[1]))
print 'test_data size is:', sys.argv[1]
random.shuffle(test_data)
start = time.time()
sort(test_data)
elapsed = (time.time() - start)
print "sort function takes " + str(elapsed)
start = time.time()
vec_sort(test_data)
elapsed = (time.time() - start)
print "sort function takes " + str(elapsed)
start = time.time()
vec_sort(test_data)
elapsed = (time.time() - start)
print "vec_sort function takes " + str(elapsed)
I have tested from array size 100 to 10,000,000 in my macbook(2GHz Intel Core 2 Duo 2GB SDRAM), the result is as following:
test_data size is: 1000
sort function
takes 0.000274896621704
vec_sort function takes 0.00383687019348
test_data size is: 10000
sort function takes 0.00380706787109
vec_sort function takes 0.0371489524841
test_data size is: 100000
sort function takes 0.0520560741425
vec_sort function takes 0.374383926392
test_data size is: 1000000
sort function takes 0.867373943329
vec_sort function takes 3.80475401878
test_data size is: 10000000
sort function takes 12.9204008579
vec_sort function takes 38.8053860664
What disappoints me is that even when the test_data size is 100,000,000, the sort function is still faster than vec_sort. Is there any way to accelerate the vec_sort function?

As Niki pointed out, you are comparing a very fast C routine with a Python one. Using psyco speeds it up a little bit for me, but you can really speed it up by using a bit vector module written in C. I used bitarray and then the bit sorting method surpasses the built-in sort for an array size of about 250,000 using psyco.
Here's the function that I used:
def vec_sort2(input_li):
bv = bitarray(len(input_li))
bv.setall(0)
for i in input_li:
bv[i] = 1
return [i for i in xrange(len(bv)) if bv[i]]
Notice also that I have used a list comprehension to construct the sorted list which helps a bit. Using psyco and the above function with your functions I get the following results:
test_data size is: 1000000
sort function takes 1.29699993134
vec_sort function takes 3.5150001049
vec_sort2 function takes 0.953999996185
As a side note, BitVector isn't especially optimized even for Python. Before I found bitarray, I did some various tweaks to the module and using my module that has the tweaks, the time for vec_sort is reduced over a second for this size of an array. I haven't submitted my changes to it though because bitarray is just so much faster.

My Python isn't the best but it looks like you have a bug in your code:
bv = BitVector( size = len(input_li) )
The size of your bitvector is the same as the size of your input array. You want the bitvector to be the size of your domain - 10^10. I'm not sure how Python's bitvectors deal with overflows, but if it automatically resizes the bitvector then you are getting quadratic behavior.
Additionally I imagine that Python's sort function is implemented in C and is not going to have the overhead of a sort implemented purely in Python. However that probably wouldn't cause an O(nlogn) algorithm to run substantially faster than an O(n) algorithm.
Edit: also this sort will only work on large data sets. Your algorithm runs in O(n + 10^10) time (based on your tests I assume you know this) which will be worse than O(nlogn) for small inputs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up kNN function in Julia - python

Related

Python Optimization on Looping Arrays

Why is the del operator in python so slow, and are there quicker alternatives?

-n and -r arguments to IPython's %timeit magic

Why inverse for loop in matlab is faster

how to implement a really efficient bitvector sorting in python

Categories

Resources