-n and -r arguments to IPython's %timeit magic

-n and -r arguments to IPython's %timeit magic - python

I would like to time a code block using the timeit magic command in a Jupyter notebook. According to the documentation, timeit takes several arguments. Two in particular control number of loops and number of repetitions. What isn't clear to me is the distinction between these two arguments. For example
import numpy
N = 1000000
v = numpy.arange(N)
%timeit -n 10 -r 500 pass; w = v + v
will run 10 loops and 500 repetitions. My question is,
Can this be interpreted as the
following? (with obvious differences the actual timing results)
import time
n = 10
r = 500
T = numpy.empty(r)
for j in range(r):
t0 = time.time()
for i in range(n):
w = v + v
T[j] = (time.time() - t0)/n
print('Best time is {:.4f} ms'.format(max(T)*1000))
An assumption I am making, and may well be incorrect, is that the time for the inner loop is averaged over the n iterations through this loop. Then the best of 500 repetitions of this loop is taken.
I have searched the documentation, and haven't found anything that specifies exactly what this is doing. For example, the documentation here is
Options: -n: execute the given statement times in a loop. If this value is not given, a fitting value is chosen.
-r: repeat the loop iteration times and take the best result. Default: 3
Nothing is really said about how the inner loop is timed. The final results is the "best" of what?
The code I want to time does not involve any randomness, so I am wondering if I should set this inner loop to n=1. Then, the r repetitions will take care of any system variability.

That number and repeat are separate arguments is because they serve different purposes. The number controls how many executions are done for each timing and it's used to get representative timings. The repeat argument controls how many timings are done and its use is to get accurate statistics. IPython uses the mean or average to calculate the run-time of the statement of all repetitions and then divides that number with number. So it measures the average of the averages. In earlier versions it used the minimum time (min()) of all repeats and divided it by number and reported it as "best of".
To understand why there are two arguments to control the number and the repeats you have to understand what you're timing and how you can measure the time.
The granularity of the clock and the number of executions
A computer has different "clocks" to measure times. These clocks have different "ticks" (depending on the OS). For example it could measure seconds, milliseconds or nanoseconds - these ticks are called the granularity of the clock.
If the duration of the execution is smaller or roughly equal to the granularity of the clock one cannot get representative timings. Suppose your operation would take 100ns (=0.0000001 seconds) but the clock only measures milliseconds (=0.001 seconds) then most measurements would measure 0 milliseconds and a few would measure 1 millisecond - which one depends on where in the clock cycle the execution started and finished. That's not really representative of the duration of what you want to time.
This is on Windows where time.time has a granularity of 1 millisecond:
import time
def fast_function():
return None
r = []
for _ in range(10000):
start = time.time()
fast_function()
r.append(time.time() - start)
import matplotlib.pyplot as plt
plt.title('measuring time of no-op-function with time.time')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This shows the histogram of the measured times from this example. Almost all measurements were 0 milliseconds and three measurements which were 1 millisecond:
There are clocks with a much lower granularity on Windows, this was just to illustrate the effect of the granularity and each clock has some granularity even if it's lower than one millisecond.
To overcome the restriction of the granularity one can increase the number of executions so the expected duration is significantly higher than the granularity of the clock. So instead of running the execution once it's run number times. Taking the numbers from above and using a number of 100 000 the expected run-time would be =0.01 seconds. So neglecting everything else the clock would now measure 10 milliseconds in almost all cases, which would accurately resemble the expected execution time.
In short specifying a number measures the sum of number executions. You need to divide the times measure this way by number again to get the "time per execution".
Other processes and the repeatitions of the execution
Your OS typically has a lot of active processes, some of them can run in parallel (different processors or using hyper-threading) but most of them run sequentially with the OS scheduling times for each process to run on the CPU. Most clocks don't care what process currently runs so the measured time will be different depending on the scheduling plan. There are also some clocks that instead of measuring system time measure the process time. However they measure the complete time of the Python process, which will sometimes includes a garbage collection or other Python threads - besides that the Python process isn't stateless and not every operation will be always exactly the same, and there are also memory allocations/re-allocations/clears happening (sometimes behind the scenes) and these memory operations times can vary depending on a lot of reasons.
Again I use a histogram measuring the time it takes to sum ten thousand ones on my computer (only using repeat and setting number to 1):
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This histogram shows a sharp cutoff at just below ~5 milliseconds, which indicates that this is the "optimal" time in which the operation can be executed. The higher timings are measurements were the conditions were not optimal or other processes/threads took some of the time:
The typical approach to avoid these fluctuations is to repeat the number of timings very often and then use statistics to get the most accurate numbers. Which statistic depends on what you want to measure. I'll go into this in more detail below.
Using both number and repeat
Essentially the %timeit is a wrapper over timeit.repeat which is roughly equivalent to:
import timeit
timer = timeit.default_timer()
results = []
for _ in range(repeat):
start = timer()
for _ in range(number):
function_or_statement_to_time
results.append(timer() - start)
But %timeit has some convenience features compared to timeit.repeat. For example it calculates the best and average times of one execution based on the timings it got by repeat and number.
These are calculated roughly like this:
import statistics
best = min(results) / number
average = statistics.mean(results) / number
You could also use TimeitResult (returned if you use the -o option) to inspect all results:
>>> r = %timeit -o ...
7.46 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
>>> r.loops # the "number" is called "loops" on the result
100000000
>>> r.repeat
7
>>> r.all_runs
[0.7445439999999905,
0.7611092000000212,
0.7249667000000102,
0.7238135999999997,
0.7385598000000186,
0.7338551999999936,
0.7277425999999991]
>>> r.best
7.238135999999997e-09
>>> r.average
7.363701571428618e-09
>>> min(r.all_runs) / r.loops # calculated best by hand
7.238135999999997e-09
>>> from statistics import mean
>>> mean(r.all_runs) / r.loops # calculated average by hand
7.363701571428619e-09
General advise regarding the values of number and repeat
If you want to modify either number or repeat then you should set number to the minimum value possible without running into the granularity of the timer. In my experience number should be set so that number executions of the function take at least 10 microseconds (0.00001 seconds) otherwise you might only "time" the minimum resolution of the "timer".
The repeat should be set as high as possible. Having more repeats will make it more likely that you really find the real best or average. However more repeats will take longer so there's a trade-off as well.
IPython adjusts number but keeps repeat constant. I often do the opposite: I adjust number so that the number executions of the statement take ~10us and then I adjust the repeat that I get a good representation of the statistics (often it's in the range 100-10000). But your mileage may vary.
Which statistic is best?
The documentation of timeit.repeat mentions this:
Note
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
For example one typically wants to find out how fast the algorithm can be, then one could use the minimum of these repetitions. If one is more interested in the average or median of the timings one can use those measurements. In most cases the number one is most interested in is the minimum, because the minimum resembles how fast the execution can be - the minimum is probably the one execution where the process was least interrupted (by other processes, by GC, or had the most optimal memory operations).
To illustrate the differences I repeated the above timing again but this time I included the minimum, mean, and median:
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import numpy as np
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto', color='black', label='measurements')
plt.tight_layout()
plt.axvline(np.min(r), c='lime', label='min')
plt.axvline(np.mean(r), c='red', label='mean')
plt.axvline(np.median(r), c='blue', label='median')
plt.legend()
Contrary to this "advise" (see quoted documentation above) IPythons %timeit reports the average instead of the min(). However they also only use a repeat of 7 by default - which I think is too less to accurately determine the minimum - so using the average in this case is actually sensible.It's a great tool to do a "quick-and-dirty" timing.
If you need something that allows to customize it based on your needs, one could use timeit.repeat directly or even a 3rd party module. For example:
pyperf
perfplot
simple_benchmark (my own library)

It looks like the latest version of %timeit is taking the average of the r n-loop averages, not the best of the averages.
Evidently, this has changed from earlier versions of Python. The best time of r averages can still be obtained via the TimeResults return argument, but it is no longer the value that is displayed.
Comment : I recently ran this code from above and found that the following syntax no longer works :
n = 1
r = 50
tr = %timeit -n $n -r $r -q -o pass; compute_mean(x,np)
That is, it is no longer possible (it seems) to use $var to pass a variable to the timeit magic command. Does this mean that this magic command should be retired and replaced with the timeit module?
I am using Python 3.7.4.

Related

Itertools combinations, ¿How to make it faster?

I am coding this program that takes 54 (num1) numbers and puts them in a list. It then takes 16 (num2) of those numbers and forms a list that contains lists of 16 numbers chosen from all the combinations possible of "num1"c"num2". It then takes those lists and generates 4x4 arrays.
The code I have works, but running 54 numbers to get all the arrays I want will take a long time. I know this because I have tested the code using from 20 up to 40 numbers and timed it.
20 numbers = 0.000055 minutes
30 numbers = 0.045088 minutes
40 numbers = 17.46944 minutes
Using all the 20 points of test data I got, I built a math model to predict how long it would take to run the 54 numbers, and I am getting 1740 minutes = 29 hours. This is already an improvement from a v1 of this code that was predicting 38 hours and from a v0 that was actually crashing my machine.
I am reaching out to you to try and make this run even faster. The program is not even RAM intensive. I have 8GB of RAM and a core i7 processor, it does not slow down my machine at all. It actually runs very smooth compared to previous versions I had where my computer even crashed a few times.
Do you guys think there is a way? I am currently sampling to reduce processing time but I would prefer not to sample it at all, if that's possible. I am not even printing the arrays also to reduce processing time, I am just printing a counter to see how many combinations I generated.
This is the code:
import numpy as np
import itertools
from itertools import combinations
from itertools import islice
from random import sample
num1 = 30 #ideally this is 54, just using 30 now so you can test it.
num2 = 16
steps = 1454226 #represents 1% of "num1"c"num2" just to reduce processing time for testing.
nums=list()
for i in range(1,num1+1):
nums.append(i)
#print ("nums: ", nums) #Just to ensure that I am indeed using numbers from 1 to num1
vun=list()
tabl=list()
counter = 0
combin = islice(itertools.combinations(nums, num2),0,None,steps)
for i in set(combin):
vun.append(sample(i,num2))
counter = counter + 1
p1=i[0];p2=i[1];p3=i[2];p4=i[3];p5=i[4];p6=i[5];p7=i[6];p8=i[7];p9=i[8]
p10=i[9];p11=i[10];p12=i[11];p13=i[12];p14=i[13];p15=i[14];p16=i[15]
tun = np.array ([(p1,p2,p3,p4),(p5,p6,p7,p8),(p9,p10,p11,p12),(p13,p14,p15,p16)])
tabl.append(tun)
# print ("TABL:" ,tabl)
# print ("vun: ", vun)
print ("combinations:",counter)
The output I get with this code is:
combinations: 101
Ideally, this number would be 2.109492366(10)¹³ or at least 1%. As long as it runs the 54x16 and does not take 29 hours.

The main inefficiency comes from generating all the combinations (itertools.combinations(nums, num2)), only to throw most of them away.
Another approach would be to generate combinations at random, ensuring there are no duplicates.
import itertools
import random
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(range(n), r))
return tuple(pool[i] for i in indices)
items = list(range(1, 55))
samples = set()
while len(samples) < 100_000:
sample = random_combination(items, 16)
samples.add(sample)
for sample in samples:
board = list(sample)
random.shuffle(board)
board = [board[0:4], board[4: 8], board[8: 12], board[12: 16]]
print("done")
This uses the random_combination function from answers to this question, which in turn comes from the itertools documentation.
The code generates 100,000 unique 4x4 samples in about ten seconds, at least on my machine.
A few notes:
Each sample is a tuple and the entries are sorted; this means we can store them in a set and avoid duplicates.
Because of the first point, we shuffle each sample before creating a 4x4 board from it; the later code doesn't do anything with these boards, but I wanted to include them to get a sense of the timing.
It's possible that there would be lots of hash collisions if you were to sample a large proportion of the space, but that's not feasible anyway because of the amount of data that would involve (see below).
I think there's been some confusion about what you are trying to achieve here.
54C16 = 2.1 x 10^13 ... to store 16 8-bit integers for all of these points would take 2.7 x 10^15 bits, which is 337.5 terabytes. That's beyond what could be stored on a local disk.
So, to cover even 1% of the space would take over 3TB ... maybe possible to store on disk at a push. You hinted in the question that you'd like to cover this proportion of the space. Clearly that's not going to happen in 8GB of RAM.

Just calculating the number of combinations is trivial, since it's just a formula:
import math
math.comb(30, 16)
# 145422675
math.comb(54, 16)
# 21094923659355
The trouble is that storing the results of the 16 of 30 case requires about 64 GB of RAM on my machine. You might but probably don't have that much RAM just sitting around like I do. The 16 of 54 case requires about 9.3 PB of RAM, which no modern architecture supports.
You're going to need to take one of two approaches:
Limit to the 16 in 30 case, and don't store any results into vun or tabl.
Pros: Can be made to work in < 5 minutes in my testing.
Cons: Doesn't work for the 16 in 54 case at all, no additional processing is practical
Do a Monte Carlo simulation instead: generate randomized combinations up to some large but reachable sample count and do your math on those.
Pros: Fast and supports both 16 of 30, and 16 of 54 potentially with the same time performance
Cons: Results will have some random variation depending on random seed, should be statistically treated to get confidence intervals for validity.
Note: The formulas used for confidence intervals depend on which actual math you intend to do with these numbers, the average is a good place to start though if you're only looking for an estimate.
I strongly suggest option (2), the Monte Carlo simulation.

How do I make a list of random unique tuples?

I've looked over several answers similar to this question, and all seem to have good oneliner answers that however only deal with the fact of making the list unique by removing duplicates. I need the list to have exactly 5.
The only code I could come up with is as such:
from random import *
tuples = []
while len(tuples) < 5:
rand = (randint(0, 6), randint(0,6))
if rand not in tuples:
tuples.append(rand)
I feel like there is a simpler way but I can't figure it out. I tried playing with sample() from random:
sample((randint(0,6), randint(0,6)), 5)
But this gives me a "Sample larger than population or is negative" error.

One quick way is to use itertools.product to generate all tuple possibilities before using sample to choose 5 from them:
from itertools import product
from random import sample
sample(list(product(range(7), repeat=2)), k=5)

For such a small set of inputs, just generate all possible outputs, and sample them:
import itertools
import random
size = 6
random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
You indicate that the bounds (size) may be a parameter though, and if the bounds might be even a little larger, this could be a problem (you'd be generating size ** 2 tuples to select 5 of them, and the memory usage could get out of control). If that's a problem, given you only need a pair of integers, there is a cheap trick: Choose one random integer that encodes both resulting integers, then decode it. For example:
size = 6
raw_sample = random.sample(range((size + 1) ** 2), 5)
decoded_sample = [divmod(x, size+1) for x in raw_sample)]
Since range is zero overhead (the memory usage doesn't depend on the length), you can select precisely five values from it with overhead proportionate to the five selected, not the 49 possible results. You then compute the quotient and remainder based on the range of a single value (0 to size inclusive in this case, so size + 1 possible values), and that gets the high and low results cheaply.
The performance differences are stark; comparing:
def unique_random_pairs_by_product(size):
return random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
to:
def unique_random_pairs_optimized(size):
val_range = size + 1
return [divmod(x, val_range) for x in random.sample(range(val_range * val_range), 5)]
the optimized version takes about 15% less time even for an argument of 6 (~4.65 μs for product, ~3.95 μs for optimized). But at size of 6, you're not seeing the scaling factor at all. For size=100, optimized only increases to ~4.35 μs (the time increasing slightly because the larger range is more likely to have to allocate new ints, instead of using the small int cache), while product jumps to 387 μs, a nearly 100x difference. And for size=1000, the time for product jumps to 63.8 ms, while optimized remains ~4.35 μs; a factor of 10,000x difference in runtime (and an even higher multiplier on memory usage). If size gets any larger than that, the product-based solution will quickly reach the point where the delay from even a single sampling is noticeable to humans; the optimized solution will continue to run with identical performance (modulo incredibly tiny differences in the cost of the divmod).

Can someone explain why 100000 is given to randrange as a starting value in this code?

I'm following along with the videos for Interactive Python's course on data structures and algorithms. In one segement the following piece of code appears. It's to demonstrate a example of O(n**2) complexity.
It's supposed to loop through the range starting from 1000, and ending at 10000. But I have no idea why 100000 is given to the randrange function in the list comprehension on line 2.
Thanks in advance!
Note: i'm following along with this course - http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/BigONotation.html
for listSize in range(1000,10001,1000):
alist = [randrange(100000) for x in range(listSize)]
start = time.time()
print(findmin(alist))
end = time.time()
print("size: %d time: %f" % (listSize, end-start))

This is a time trial, testing how fast findmin() is. That's best done with randomised data, to avoid pathological cases. The list comprehension produces the test data. The 100000 is just an upper bound for the random values in that list, high enough to ensure that even for a list with 10k integers there is a nice spread of values.
Note that it is better to use the timeit module to execute time trials.

Can I make an O(1) search algorithm using a sorted array with a known step?

Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation

The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))

What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.

That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.

Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)

Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.

How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.

I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.