I would like to propagate uncertainty using python. This is relatively easy for simple functions via the uncertainties package. However, it is not that obvious to achieve the same with a user defined function. What follows is an example of what I am trying to do.
import mcerp as err
import numpy as np
def mult_func(x,xm ,a):
x[x==0.] = 1e-20
v = (1.-(xm/x)**a) * (x > xm)
v[np.isnan(v)] = 0.
return v
def intg(e,f,cut,s):
t = mult_func(e,cut,s)
res = np.trapz(t*f,e)
return res
x=np.linspace(0,1,10000)
y=np.exp(x)
m=0.
mm=0.
N=100000
for i in range(0,N):
cut=np.random.normal(0.21,0.02)
stg=np.random.normal(1.1,0.1)
v=intg(x,y,cut,stg)
m=m+v
mm=mm+v*v
print("avg. %10.5E +/- %10.5E fixed %10.5E"%(m/N,np.sqrt((mm/N-(m/N)**2)),intg(x,y,0.21,1.1)))
What is done above is just random sampling of two parameters and calculating the mean and the variance. I am not sure however, how much this brute-force method is adequate. I could use the law of large numbers and to try estimate how many trials N are needed to get a certain value (P=1-1/(N*k**2)) to be around k times standard deviations around the true mean.
In principle what I wrote could work. However, my assumption is that being such a flexible language with many powerful packages, python could do this task much more effectively. I was thinking about uncertainties, mcerp and pymc. Due to my limited experience using those packages, I am not sure how to proceed.
EDIT:
MY original example was not that much informative, this is why I decided to do a new example which actually works to illustrate my idea.
Numpy supports arrays of arbitrary numeric types. However, not all functions are supporter for arbitrary numeric types.
In this case, both numpy.exp and trapz are not supported.
Note that the uncertanties module contains the unumpy package. numpy.exp has a replacement here: uncertainties.unumpy.exp
We, define trapz as a ufunc.
Check it out here!
a=un.ufloat(0.3,0.01)
b=un.ufloat(1.2,0.071)
def sample_func(a: un.UFloat, b: un.UFloat) -> np.ndarray:
x=np.linspace(0,a,100)
y = un.unumpy.exp(x)
return utrapz(y, x)
def utrapz(y: np.ndarray, x: np.ndarray) -> np.ndarray:
Δx = x[1:]-x[:-1]
avg_y = (y[1:]+y[:-1])/2
return (Δx*avg_y)
print(sample_func(a, b))
OUT:
[0.00026601240063021264+/-nan 0.0005935120815465686+/-6.429403852670308e-06
0.0006973604419223405+/-3.888235103342809e-06 ...,
0.002095505706899622+/-6.503985178118233e-05
0.0021019968633076134+/-6.545802781649068e-05
0.0021084415802710295+/-6.587387316821736e-05]
Related
I need to find the coefficient of a term in a rather long, nasty expansion. I have a polynomial, say f(x) = (x+x^2)/2 and then a function that is defined recursively: g_k(x,y) = y*f(g_{k-1}(x,y)) with g_0(x,y)=yx.
I want to know, say, the coefficient of x^2y^4 in g_10(x,y)
I've coded this up as
import sympy
x, y = sympy.symbols('x y')
def f(x):
return (x+x**2)/2
def g(x,y,k):
if k==0:
return y*x
else:
return y*f(g(x,y,k-1))
fxn = g(x,y,2)
fxn.expand().coeff(x**2).coeff(y**4)
> 1/4
So far so good.
But now I want to find a coefficient for k = 10. Now fxn = g(x,y,10) and then fxn.expand() is very slow. Obviously there are a lot of steps going on, so it's not a surprise. But my knowledge of sympy is rudimentary - I've only started using it specifically because I need to be able to find these coefficients. I could imagine that there may be a way to get sympy to recognize that everything is a polynomial and so it can more quickly find a particular coefficient, but I haven't been able to find examples doing that.
Is there another approach through sympy to get this coefficient, or anything I can do to speed it up?
I assume you are only interested in the coefficients given and not the whole polynomial g(x,y,10). So you can redefine your function g to get rid of higher orders in every step of the recursion. This will significantly speed up your calculation.
def g(x,y,k):
if k==0:
return y*x
else:
temp = y*f(g(x,y,k-1)) + sympy.O(y**5) + sympy.O(x**3)
return temp.expand().removeO()
Works as follows: First everything of the order O(y**5), O(x**3) (and higher) will be grouped and then discarded. Keep in mind you loose lots of information!
Also have a look here: Sympy: Drop higher order terms in polynomial
Intro
There is a pattern that I use all the time in my Python code which analyzes
numerical data. All implementations seem overly redundant or very cumbersome or
just do not play nicely with NumPy functions. I'd like to find a better way to
abstract this pattern.
The Problem / Current State
A method of statistical error propagation is the bootstrap method. It works by
running the same analysis many times with slightly different inputs and look at
the distribution of final results.
To compute the actual value of ams_phys, I have the following equation:
ams_phys = (amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
All the values that go into that equation have a statistical error associated
with it. These values are also computed from other equations. For instance
amk_phys is computed from this equation, where both numbers also have
uncertainties:
amk_phys_dist = mk_phys / a_inv
The value of mk_phys is given as (494.2 ± 0.3) in a paper. What I now do is
parametric bootstrap and generate R samples from a Gaussian distribution
with mean 494.2 and standard deviation 0.3. This is what I store in
mk_phys_dist:
mk_phys_dist = bootstrap.make_dist(494.2, 0.3, R)
The same is done for a_inv which is also quoted with an error in the
literature. Above equation is then converted into a list comprehension to yield
a new distribution:
amk_phys_dist = [mk_phys / a_inv
for a_inv, mk_phys in zip(a_inv_dist, mk_phys_dist)]
The first equation is then also converted into a list comprehension:
ams_phys_dist = [
(amk_phys**2 - 0.5 * ampi_phys**2) / aB - amcr
for ampi_phys, amk_phys, aB, amcr
in zip(ampi_phys_dist, amk_phys_dist, aB_dist, amcr_dist)]
To get the end result in terms of (Value ± Error), I then take the average and
standard deviation of this distribution of numbers:
ams_phys_val, ams_phys_avg, ams_phys_err \
= bootstrap.average_and_std_arrays(ams_phys_dist)
The actual value is supposed to be computed with the actual value coming in,
not the mean of this bootstrap distribution. Before I had the code replicated
for that, now I have the original value at the 0th position in the _dist
arrays. The arrays now contain 1 + R elements and the
bootstrap.average_and_std_arrays function will separate that element.
This kind of line occurs for every number that I might want to quote in my
writing. I got annoyed by the writing and created a snippet for it:
$1_val, $1_avg, $1_err = bootstrap.average_and_std_arrays($1_dist)
The need for the snippet strongly told me that I need to do some refactoring.
Also the list comprehensions are always of the following pattern:
foo_dist = [ ... bar ...
for bar in bar_dist]
It feels bad to write bar three times there.
The Class Approach
I have tried to make those _dist things a Boot class such that I would not
write ampi_dist and ampi_val but could just use ampi.val without having
to explicitly call this average_and_std_arrays functions and type a bunch of
names for it.
class Boot(object):
def __init__(self, dist):
self.dist = dist
def __str__(self):
return str(self.dist)
#property
def cen(self):
return self.dist[0]
#property
def val(self):
x = np.array(self.dist)
return np.mean(x[1:,], axis=0)
#property
def err(self):
x = np.array(self.dist)
return np.std(x[1:,], axis=0)
However, this still does not solve the problem of the list comprehensions. I
fear that I still have to repeat myself there three times. I could make the
Boot object inherit from list, such that I could at least write it like
this (without the _dist):
bar = Boot([... foo ... for foo in foo])
Magic Approach
Ideally all those list comprehensions would be gone such that I could just
write
bar = ... foo ...
where the dots mean some non-trivial operation. Those can be simple arithmetic
as above, but that could also be a function call to something that does not
support being called with multiple values (like NumPy function do support).
For instance the scipy.optimize.curve_fit function needs to be called a bunch of times:
popt_dist = [op.curve_fit(linear, mpi, diff)[0]
for mpi, diff in zip(mpi_dist, diff_dist)]
One would have to write a wrapper for that because it does not automatically loops over list of arrays.
Question
Do you see a way to abstract this process of running every transformation with
1 + R sets of data? I would like to get rid of those patterns and the huge
number of variables in each namespace (_dist, _val, _avg, ...) as this
makes passing it to function rather tedious.
Still I need to have a lot of freedom in the ... foo ... part where I need to
call arbitrary functions.
I really need help as I am stuck at the begining of the code.
I am asked to create a function to investigate the exponential distribution on histogram. The function is x = −log(1−y)/λ. λ is a constant and I referred to that as lamdr in the code and simply gave it 10. I gave N (the number of random numbers) 10 and ran the code yet the results and the generated random numbers gave me totally different results; below you can find the code, I don't know what went wrong, hope you guys can help me!! (I use python 2)
import random
import math
N = raw_input('How many random numbers you request?: ')
N = int(N)
lamdr = raw_input('Enter a value:')
lamdr = int(lamdr)
def exprand(lamdr):
y = []
for i in range(N):
y.append(random.uniform(0,1))
return y
y = exprand(lamdr)
print 'Randomly generated numbers:', (y)
x = []
for w in y:
x.append((math.log((1 - w) / lamdr)) * -1)
print 'Results:', x
After viewing the code you provided, it looks like you have the pieces you need but you're not putting them together.
You were asked to write function exprand(lambdr) using the specified formula. Python already provides a function called random.expovariate(lambd) for generating exponentials, but what the heck, we can still make our own. Your formula requires a "random" value for y which has a uniform distribution between zero and one. The documentation for the random module tells us that random.random() will give us a uniform(0,1) distribution. So all we have to do is replace y in the formula with that function call, and we're in business:
def exprand(lambdr):
return -math.log(1.0 - random.random()) / lambdr
An historical note: Mathematically, if y has a uniform(0,1) distribution, then so does 1-y. Implementations of the algorithm dating back to the 1950's would often leverage this fact to simplify the calculation to -math.log(random.random()) / lambdr. Mathematically this gives distributionally correct results since P{X = c} = 0 for any continuous random variable X and constant c, but computationally it will blow up in Python for the 1 in 264 occurrence where you get a zero from random.random(). One historical basis for doing this was that when computers were many orders of magnitude slower than now, ditching the one additional arithmetic operation was considered worth the minuscule risk. Another was that Prime Modulus Multiplicative PRNGs, which were popular at the time, never yield a zero. These days it's primarily of historical interest, and an interesting example of where math and computing sometimes diverge.
Back to the problem at hand. Now you just have to call that function N times and store the results somewhere. Likely candidates to do so are loops or list comprehensions. Here's an example of the latter:
abuncha_exponentials = [exprand(0.2) for _ in range(5)]
That will create a list of 5 exponentials with λ=0.2. Replace 0.2 and 5 with suitable values provided by the user, and you're in business. Print the list, make a histogram, use it as input to something else...
Replacing exporand with expovariate in the list comprehension should produce equivalent results using Python's built-in exponential generator. That's the beauty of functions as an abstraction, once somebody writes them you can just use them to your heart's content.
Note that because of the use of randomness, this will give different results every time you run it unless you "seed" the random generator to the same value each time.
WHat #pjs wrote is true to a point. While statement mathematically, if y has a uniform(0,1) distribution, so does 1-y appears to be correct, proposal to replace code with -math.log(random.random()) / lambdr is just wrong. Why? Because Python random module provide U(0,1) in the range [0,1) (as mentioned here), thus making such replacement non-equivalent.
In more layman term, if your U(0,1) is actually generating numbers in the [0,1) range, then code
import random
def exprand(lambda):
return -math.log(1.0 - random.random()) / lambda
is correct, but code
import random
def exprand(lambda):
return -math.log(random.random()) / lambda
is wrong, it will sometimes generate NaN/exception, as log(0) will be called
I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to certain positive weights. What's a fast way to obtain them? I know I can pass weights to numpy.random.choice as optional argument p, but the function seems quite slow, and building an arange to pass it is not ideal either. The sum of weights can be an arbitrary positive number and is not guaranteed to be 1, which makes the approach to generate a random number in (0,1] and then substracting weight entries until the result is 0 or less impossible.
While there are answers on how to implement similar things (mostly not about obtaining the array index, but the corresponding element) in a simple manner, such as Weighted choice short and simple, I'm looking for a fast solution, because the appropriate function is executed very often. My weights change frequently, so the overhead of building something like an alias mask (a detailed introduction can be found on http://www.keithschwarz.com/darts-dice-coins/) should be considered part of the calculation time.
Cumulative summing and bisect
In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array
def weighted_choice(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
if speed is a concern. A more detailed analysis is given below.
Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613
Experimental Analysis
There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:
import timeit
weighted_choice_functions = [
"""import numpy
wc = lambda weights: numpy.random.choice(
range(len(weights)),
p=weights/weights.sum())
""",
"""import numpy
# Adapted from https://stackoverflow.com/a/19760118/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return cs.searchsorted(numpy.random.random() * cs[-1], 'right')
""",
"""import numpy, bisect
# Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
""",
"""import numpy
wc = lambda weights: numpy.random.multinomial(
1,
weights/weights.sum()).argmax()
"""]
for setup in weighted_choice_functions:
for ps in ["numpy.ones(40)",
"numpy.arange(10)",
"numpy.arange(200)",
"numpy.arange(199,-1,-1)",
"numpy.arange(4000)"]:
timeit.timeit("wc(%s)"%ps, setup=setup)
print()
The resulting output is
178.45797914802097
161.72161589498864
223.53492237901082
224.80936180002755
1901.6298267539823
15.197789980040397
19.985687876993325
20.795070077001583
20.919113760988694
41.6509403079981
14.240949985047337
17.335801470966544
19.433710905024782
19.52205040602712
35.60536142199999
26.6195822560112
20.501282756973524
31.271995796996634
27.20013752405066
243.09768892999273
This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.
I would like to numerically integration a function using multiple cpus in python. I would like to do something like:
from scipy.integrate import quad
import multiprocessing
def FanDDW(arguments):
wtq,eigq_files,DDB_files,EIGR2D_files,FAN_files = arguments
...
return tot_corr
# Numerical integration
def integration(frequency):
# Parallelize the work over cpus
pool = multiprocessing.Pool(processes=nb_cpus)
total = pool.map(FanDDW, zip(wtq,eigq_files,DDB_files,EIGR2D_files,FAN_files))
FanDDW_corr = sum(total)
return quad(FanDDW, -Inf, Inf, args=(zip(wtq,eigq_files,DDB_files,EIGR2D_files,FAN_files)))[0]
vec_functionint = vectorize(integration)
vec_functionint(3,arange(1.0,4.0,0.5))
Also "frequency" is a global variable (external to FanDDW(arguments)). It is a vector containing the position where the function must be evaluated. I guess that quad should choose frequency in a clever way. How to pass it to FanDDW knowing that it should NOT be distributed among CPUs and that pool.map does exactly that (it is the reason why I did put it as a global variable and did not pass it to the definition as argument).
Thank you for any help.
Samuel.
All classical quadrature rules have the form
The computation of the f(x_i) is typically the most costly, so if you want to use multiple CPUs, you'll have to think about how to design your f. The sum can be expressed as a scalar product <w, f(x_i)>, and when using numpy.dot for it, it uses threading on most architectures.
quadpy (a project of mine) calls your integrand with all points and once, so in f you have to chance to get fancy with the computations.
import quadpy
def f(x):
print(x.shape) # (1, 50)
return x[0] ** 2
scheme = quadpy.e1r2.gauss_hermite(50)
val = scheme.integrate(f)
print(val) # 0.886226925452758