Finding a abstraction for repetitive code: Bootstrap analysis

Finding a abstraction for repetitive code: Bootstrap analysis - python

Intro
There is a pattern that I use all the time in my Python code which analyzes
numerical data. All implementations seem overly redundant or very cumbersome or
just do not play nicely with NumPy functions. I'd like to find a better way to
abstract this pattern.
The Problem / Current State
A method of statistical error propagation is the bootstrap method. It works by
running the same analysis many times with slightly different inputs and look at
the distribution of final results.
To compute the actual value of ams_phys, I have the following equation:
ams_phys = (amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
All the values that go into that equation have a statistical error associated
with it. These values are also computed from other equations. For instance
amk_phys is computed from this equation, where both numbers also have
uncertainties:
amk_phys_dist = mk_phys / a_inv
The value of mk_phys is given as (494.2 ± 0.3) in a paper. What I now do is
parametric bootstrap and generate R samples from a Gaussian distribution
with mean 494.2 and standard deviation 0.3. This is what I store in
mk_phys_dist:
mk_phys_dist = bootstrap.make_dist(494.2, 0.3, R)
The same is done for a_inv which is also quoted with an error in the
literature. Above equation is then converted into a list comprehension to yield
a new distribution:
amk_phys_dist = [mk_phys / a_inv
for a_inv, mk_phys in zip(a_inv_dist, mk_phys_dist)]
The first equation is then also converted into a list comprehension:
ams_phys_dist = [
(amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
for ampi_phys, amk_phys, aB, amcr
in zip(ampi_phys_dist, amk_phys_dist, aB_dist, amcr_dist)]
To get the end result in terms of (Value ± Error), I then take the average and
standard deviation of this distribution of numbers:
ams_phys_val, ams_phys_avg, ams_phys_err \
= bootstrap.average_and_std_arrays(ams_phys_dist)
The actual value is supposed to be computed with the actual value coming in,
not the mean of this bootstrap distribution. Before I had the code replicated
for that, now I have the original value at the 0th position in the _dist
arrays. The arrays now contain 1 + R elements and the
bootstrap.average_and_std_arrays function will separate that element.
This kind of line occurs for every number that I might want to quote in my
writing. I got annoyed by the writing and created a snippet for it:
$1_val, $1_avg, $1_err = bootstrap.average_and_std_arrays($1_dist)
The need for the snippet strongly told me that I need to do some refactoring.
Also the list comprehensions are always of the following pattern:
foo_dist = [ ... bar ...
for bar in bar_dist]
It feels bad to write bar three times there.
The Class Approach
I have tried to make those _dist things a Boot class such that I would not
write ampi_dist and ampi_val but could just use ampi.val without having
to explicitly call this average_and_std_arrays functions and type a bunch of
names for it.
class Boot(object):
def __init__(self, dist):
self.dist = dist
def __str__(self):
return str(self.dist)
#property
def cen(self):
return self.dist[0]
#property
def val(self):
x = np.array(self.dist)
return np.mean(x[1:,], axis=0)
#property
def err(self):
x = np.array(self.dist)
return np.std(x[1:,], axis=0)
However, this still does not solve the problem of the list comprehensions. I
fear that I still have to repeat myself there three times. I could make the
Boot object inherit from list, such that I could at least write it like
this (without the _dist):
bar = Boot([... foo ... for foo in foo])
Magic Approach
Ideally all those list comprehensions would be gone such that I could just
write
bar = ... foo ...
where the dots mean some non-trivial operation. Those can be simple arithmetic
as above, but that could also be a function call to something that does not
support being called with multiple values (like NumPy function do support).
For instance the scipy.optimize.curve_fit function needs to be called a bunch of times:
popt_dist = [op.curve_fit(linear, mpi, diff)[0]
for mpi, diff in zip(mpi_dist, diff_dist)]
One would have to write a wrapper for that because it does not automatically loops over list of arrays.
Question
Do you see a way to abstract this process of running every transformation with
1 + R sets of data? I would like to get rid of those patterns and the huge
number of variables in each namespace (_dist, _val, _avg, ...) as this
makes passing it to function rather tedious.
Still I need to have a lot of freedom in the ... foo ... part where I need to
call arbitrary functions.

Related

Matrix Implementation in Python

I am trying to implement a Matrix of Complex numbers in Python. But I am stuck at a particular point in the program. I have two modules Matrix.py, Complex.py and one test program test.py. The module implementation is hosted in Github at https://github.com/Soumya1234/Math_Repository/tree/dev_branch and my test.py is given below
from Matrix import *
from Complex import *
C_init = Complex(2, 0)
print C_init
m1 = Matrix(2, 2, C_init)
m1.print_matrix()
C2= Complex(3, 3)
m1.addValue(1, 1, C2)//This is where all values of the matrix are getting
changed. But I want only the (1,1)th value to be changed to C2
m1.print_matrix()
As mentioned in the comment, the addValue(self,i,j) in Matrix.py is supposed to change the value at the (i,j)th position only. Then why the entire matrix is getting replaced? what I am doing wrong?

If you don't want to implicitly make copies of init_value you could also change Matrix.addValue to this:
def addValue(self,i,j,value):
self.matrix_list[i][j] = value
This is a little more in line with how your Matrix currently works. It's important to remember that a Complex object can't implicitly make a copy of itself, so matrix_list actually has a lot of identical objects (pointers to one object in memory) so if you modify the object in-place, it will change everywhere.
Another tip - try to use the __init__ of Complex meaningfully. You could change this kind of thing:
def __sub__(self,complex_object):
difference=Complex(0,0)
difference.real=self.real-complex_object.real
difference.imag=self.imag-complex_object.imag
return difference
To this:
def __sub__(self, other):
return Complex(self.real - other.real,
self.imag - other.imag)
Which is more concise, doesn't use temporary initialisations or variables, and I find more readable. It might also benefit you to add some kind of .copy() method to Complex, which returns a new Complex object with the same values.
On your methods for string representation - I'd recommend displaying the real and imaginary values as floats, not integers, because they should be real numbers. Here I've rounded them to 2 decimal places:
def __repr__(self):
return "%.2f+j%.2f" %(self.real,self.imag)
Note also that you actually shouldn't need __str__ if it should do the same thing as __repr__. Also, show seems to be doing roughly the same, again.
Also, in Python, there are no private variables, so instead of getReal it's entirely possible to just access it by .real. If you really need getter/setter methods look into #property.
As you're already doing some overloading, I would also recommend implementing addValue in __getitem__, which I think is a good fit of index setting under Python's data model. If you do this:
def __setitem__(self, inds, value):
i, j = inds
self.matrix_list[i][j] = value
You could change the addValue in test.py to this:
m1[1, 1] = C2

The problem is, that in your matrix initialization method you add the same value C_init to all entries of your matrix. As you are not just setting its value in each of the entries but the item itself, you get a huge problem afterwards: as the item stored in (0,0) is the same object as in all other entries, you change all entries together when you just want to change one entry.
You have to modify your initialization method like this:
def __init__(self,x,y,init_value):
self.row=x
self.column=y
self.matrix_list=[[Complex(init_value.getReal(), init_value.getComplex()) for i in range(y)] for j in range(x)]
In this way you add entries of the same value to your matrix, but its not every time a reference to the same object.
Furthermore: just for practicing this is a good example, but if you want to use the Matrix class yo compute something, you should rather use numpy arrays.

scipy.optimize.minimize chi squared python

So i am doing this assignment, where i am supposed to minimize the chi squared function. I saw someone doing this on the internet so i just copied it:
Multiple variables in SciPy's optimize.minimize
I made a chi-squared function which is a function in 3 variables (x,y,sigma) where sigma is random gaussian fluctuation random.gauss(0,sigma). I did not print that code here because on first sight it might be confusing (I used a lot of recursion). But i can assure you that this function is correct.
now this code just makes a list of the calculated minimization(Which are different every time because of the random gaussian fluctuation). But here comes the main problem. If i did my calculation correctly, we should get a list with a mean of 2 (since i have 2 degrees of freedom as you can see in this link: https://en.wikipedia.org/wiki/Chi-squared_test).
def Chi2(pos):
return Chi(pos[0],pos[1],1)
x_list= []
y_list= []
chi_list = []
for i in range(1000):
result = scipy.optimize.minimize(Chi2,[5,5]).x
x_list.append(result[0])
y_list.append(result[1])
chi_list.append(Chi2(result))
But when i use this code i get a list of mean 4, however if i add the method "Powell" i get a mean of 9!!
So my main question is, how is it possible these means are so different and how do i know which method to use to get the best optimization?
Because i think the error might be in my chisquare function i will show this one as well. The story behind this assignment is that we need to find the position of a mobile device and we have routers on the positions (0,0),(20,0),(0,20) and (20,20). We used a lot of recursion, and the graph of the chi_squared looked fine(it has a minimum on (5,5)
def perfectsignal(x_m,y_m,x_r,y_r):
return 20*np.log10(c / (4 * np.pi * f)) - 10 * np.log((x_m-x_r)**2 + (y_m-y_r)**2 + 2**2)
def signal(x_m,y_m,x_r,y_r,sigma):
return perfectsignal(x_m,y_m,x_r,y_r) + random.gauss(0,sigma)
def res(x_m,y_m,x_r,y_r,sigma,sigma2):
x = (signal(x_m,y_m,x_r,y_r,sigma) - perfectsignal(x_m,y_m,x_r,y_r))/float(sigma2);
return x
def Chi(x,y,sigma):
return(res(x,y,0,0,sigma,1)**2+res(x,y,20,0,sigma,1)**2+res(x,y,0,20,sigma,1)**2+res(x,y,20,20,sigma,1)**2)
Kees

Empty zeroth element in array/list to eliminate repeated decrementing. Does this improve performance?

I am using Python to solve Project Euler problems. Many require caching the results of past calculations to improve performance, leading to code like this:
pastResults = [None] * 1000000
def someCalculation(integerArgument):
# return result of a calculation performed on numberArgument
# for example, summing the factorial or square of its digits
for eachNumber in range(1, 1000001)
if pastResults[eachNumber - 1] is None:
pastResults[eachNumber - 1] = someCalculation(eachNumber)
# perform additional actions with pastResults[eachNumber - 1]
Would the repeated decrementing have an adverse impact on program performance? Would having an empty or dummy zeroth element (so the zero-based array emulates a one-based array) improve performance by eliminating the repeated decrementing?
pastResults = [None] * 1000001
def someCalculation(integerArgument):
# return result of a calculation performed on numberArgument
# for example, summing the factorial or square of its digits
for eachNumber in range(1, 1000001)
if pastResults[eachNumber] is None:
pastResults[eachNumber] = someCalculation(eachNumber)
# perform additional actions with pastResults[eachNumber]
I also feel that emulating a one-based array would make the code easier to follow. That is why I do not make the range zero-based with for eachNumber in range(1000000) as someCalculation(eachNumber + 1) would not be logical.
How significant is the additional memory from the empty zeroth element? What other factors should I consider? I would prefer answers that are not confined to Python and Project Euler.
EDIT: Should be is None instead of is not None.

Not really an answer to the question regarding the performance, rather a general tip about caching previously calculated values. The usual way to do this is to use a map (Python dict) for this, as this allows to use more complex keys instead of just integer numbers, like floating point numbers, strings, or even tuples. Also, you won't run into problems in case your keys are rather sparse.
pastResults = {}
def someCalculation(integerArgument):
if integerArgument not in pastResults:
pastResults[integerArgument] = # calculation performed on numberArg.
return pastResults[integerArgument]
Also, there is no need to perform the calculations "in order" using a loop. Just call the function for the value you are interested in, and the if statement will take care that, when invoked recursively, the function is called only once for each argument.
Ultimately, if you are using this a lot (as clearly the case for Project Euler) you can define yourself a function decorator, like this one:
def memo(f):
f.cache = {}
def _f(*args, **kwargs):
if args not in f.cache:
f.cache[args] = f(*args, **kwargs)
return f.cache[args]
return _f
What this does is: It takes a function and defines another function that first checks whether the given parameters can be found in the cache, and otherwise calculates the result of the original function and puts it into the cache. Just add the #memo annotation to your function definitions and this will take care of caching for you.
#memo
def someCalculation(integerArgument):
# function body
This is syntactic sugar for someCalculation = memo(someCalculation). Note however, that this will not always work out well. First, the paremters have to be hashable (no lists or other mutable types); second, in case you are passing parameters that are not relevant for the result (e.g., debugging stuff etc.) your cache can grow unnecessarily large, as all the parameters are used as the key.

Lowpass Filter in python

I am trying to convert a Matlab code to Python. I want to implement fdesign.lowpass() of Matlab in Python. What will be the exact substitute of this Matlab code using scipy.signal.firwin():
demod_1_a = mod_noisy * 2.*cos(2*pi*Fc*t+phi);
d = fdesign.lowpass('N,Fc', 10, 40, 1600);
Hd = design(d);
y = filter(Hd, demod_1_a);

A very basic approach would be to invoke
# spell out the args that were passed to the Matlab function
N = 10
Fc = 40
Fs = 1600
# provide them to firwin
h = scipy.signal.firwin(numtaps=N, cutoff=40, nyq=Fs/2)
# 'x' is the time-series data you are filtering
y = scipy.signal.lfilter(h, 1.0, x)
This should yield a filter similar to the one that ends up being made in the Matlab code.
If your goal is to obtain functionally equivalent results, this should provide a useful
filter.
However, if your goal is that the python code provide exactly the same results,
then you'll have to look under the hood of the design call (in Matlab); From my quick check, it's not trivial to parse through the Matlab calls to identify exactly what it is doing, i.e. what design method is used and so on, and how to map that into corresponding scipy calls. If you really want compatibility, and you only need to do this for a limited number
of filters, you could, by hand, look at the Hd.Numerator field -- this array of numbers directly corresponds to the h variable in the python code above. So if you copy those
numbers into an array by hand, you'll get numerically equivalent results.

Sinusoids in python

Write the function sinusoid(a, w, n) that will return a list of ordered pairs representing n cycles of a sinusoid with amplitude a and frequency w. Each cycle should contain 180 ordered pairs.
So far I have:
def sinusoid(a,w,n):
return [a*sin(x) for x in range 180]

Please consider the actual functional form of a sinusoidal wave and how the frequency comes into the equation. (Hint: http://en.wikipedia.org/wiki/Sine_wave).
Not sure what is meant exactly by 'ordered pairs', but I would assume it means the x,y pairs. Currently you're only returning a list of single values. Also you might want to take a look at the documentation for Python's sin function.

Okay, we know this is a homework assignment and we're not going to do it for you. However, I'll give you a couple hints.
The instructions:
Write the function sinusoid(a, w, n) that will return a list of ordered pairs representing n cycles of a sinusoid with amplitude a and frequency w. Each cycle should contain 180 ordered pairs.
... translated into a bullet list of requirements:
Write a function
... named sinusoid()
... taking three arguments: a, w, and n
returning a list
... of n cycles(?)
... (each consisting of?) 180 "ordered pairs"
The example you've given does define a function, by the correct name, and taking the correct number of arguments. That's a start (not much of one, frankly, but it's something).
The obvious failings are that it doesn't use two of the arguments that are required and it doesn't return pairs of anything. It seems that it would return 180 numbers which are based on the argument supplied to its first parameter.
Surely you can do a bit better than that.
Let's start with a stub:
def sinusoid(a, w, n):
'''Return n cycles of the sinusoid for a given amplitude and frequence
where each cycle consists of 180 ordered pairs
'''
results = list()
# do stuff here
return results
That's a function, takes three arguments and returns a list. Now for that list to contain anything before we return it we'll have to append some things to it ... and the instructions tell us how many things it should return (n times 180) and what sorts of things they should be (ordered pairs).
That sounds quite a bit like we'll need a loop (for n) and another (for 180). Hmmm ...
That might look like:
for each_cycle in range(n):
for each_pair in range(180):
# do something here
results.append(something) # where something is a tuple ... an "ordered pair"
... or it might look like:
for each_cycle in range(n):
this_cycle = list()
for each_pair in range(180):
this_cycle.append(something)
results.extend(this_cycle)
... or it might even look like:
for each_pair in range(n*180):
results.append(something)
... though, frankly, that seems unlikely. (If you try flattening the inner loop to the outer loop in this way you might find that you're having to use modulo arithmetic to get n back out for some other intermediate computational purposes).
I have no idea what the instructor is actually asking for. It seems likely that the math.sin() function will be involved and I guess "ordered pairs" might be co-ordinates mapped to some sort of graphics subsystem and suitable for plotting a graph. I guess 180 of these to show the sinusoid wave through a full range of its values. Maybe you're supposed to multiply something by the amplitude and/or divide something else by the frequency and maybe you're supposed to even add something for each cycle ... some sort of offset to keep the plot moving towards the right or something.
But it seems like you might start with that stub of a function definition and try pasting in one or another of these loop bodies and then figuring out how to actually return meaningful values in the parts where I've used "something" as a placeholder.
Going with the assumption that these "ordered pairs" are co-ordinates, for plotting, then it seems likely that each of the things you append to your results should be of the form (x,y) where x is monotonically increasing (fancy way of saying it keeps going up, never goes down) and might even always be the range(0,n*180) and y is probably math.sin() of something involved a and w ... but that's just speculation on my part.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding a abstraction for repetitive code: Bootstrap analysis - python

Related

Matrix Implementation in Python

scipy.optimize.minimize chi squared python

Empty zeroth element in array/list to eliminate repeated decrementing. Does this improve performance?

Lowpass Filter in python

Sinusoids in python

Categories

Resources