I am trying to simulate a simple conditional probability problem. You hae two boxes. If you open A you have a 50% change of wining the prize, If you open B you have a 75% chance of winning. With some simple (bad) python I have tired
But the appending doesn't work. Any thoughts on a neater way of doing this?
import random
import numpy as np
def liveORdie(prob):
#Takes an argument of the probability of survival
live = 0
for i in range(100):
if random.random() <= prob*1.0:
live =1
return live
def simulate(n):
trials = np.array([0])
for i in range(n):
if random.random() <= 0.5:
np.append(trials,liveORdie(0.5))
print(trials)
else:
np.append(trials,liveORdie(0.75))
return(sum(trials)/n)
simulate(10)
You could make the code tighter by using list comprehensions and numpy's array operations, like so:
import random
import numpy as np
def LiveOrDie():
prob = 0.5 if random.random()<=0.5 else 0.75
return np.sum(np.random.random(100)<=prob)
def simulate(n):
trials = [LiveOrDie() for x in range(n)]
return(sum(trials)/n)
Simulate(10)
append is a list operation; you're forcing it onto a numpy array, which is not the same thing. Stick with a regular list, since you're not using any extensions specific to an array.
def simulate(n):
trials = []
for i in range(n):
if random.random() <= 0.5:
trials.append(liveORdie(0.5))
Now look at your liveORdie routine. I don't think this is what you want: you loop 100 times to produce a single integer ... and if any one of your trials comes up successful, you return a 1. Since you haven't provided documentation for your algorithms, I'm not sure what you want, but I suspect that it's a list of 100 trials, rather than the conjunction of all 100. You need to append here, as well.
Better yet, run through a tutorial on list comprehension, and use those.
The loop in liveORdie() (please consider PEP8 for naming conventions) will cause the probability of winning to increase: each pass of the loop has a prop chance of winning, and you give it 100 tries, so with 50% resp. 75% you are extremely likely to win.
Unless I really misunderstood the problem, you probably just want
def live_or_die(prob):
return random.random() < prob
I'm pretty sure that just reduces to:
from numpy import mean
from numpy.random import choice
from scipy.stats import bernoulli
def simulate(n):
probs = choice([0.5, 0.75], n)
return 1-mean(bernoulli.rvs((1-probs)**100))
which as others have pointed out will basically always return 1 — 0.5**100 is ~1e-30.
Related
The quantity to be computed is log(k!), where k could be 4000 or even higher, but of course the log will compensate. I tried computing sum(log(k)) which is the same.
So, I am given an large array with integers and I want to efficiently compute sum(log(k)). This was my attempt:
integers = np.asarray([435, 535, 242,])
score = np.sum(np.log(np.arange(1,integers+1)))
This would work, except that np.arange would generate an array of different size for each integer, so when I run that, it gives me an error (as it should).
The problem could be easily solved with a for loop as follows:
scores = []
for i in range(integers.shape[0]):
score = np.sum(np.log(np.arange(1,integer[i]+1)))
scores.append(score)
but that's too slow. My actual integers has millions of value to be computed.
Is there an efficient implementation for this that basically that doesn't need a for loop? I was thinking of a lambda function or something like that, but I am not really sure how to apply it. Any help is appreciated!
How about math.lgamma? Gamma function is factorial, and lgamma is log of gamma.
You don't need to compute factorial and then log.
There is also gammaln in the SciPy
Code, Python 3.9 x64 Win 10
import numpy as np
from scipy.special import gammaln
startf = 1 # start of factorial sequence
stopf = 400 # end of of factorial sequence
q = gammaln(range(startf+1, stopf+1)) # n! = G(n+1)
print(q)
looks reasonable to me
You can vectorize with something like this:
mi = integers.max()
ls = np.log(np.arange(2, mi + 1))
Two optimizations so far: you only need the range up to the maximum, since the other numbers are covered by that, and you don't need log(1).
Now you take the cumulative sum:
cs = np.cumsum(ls)
The desired elements can be indexed directly:
result = cs[integers - 2]
If this is something you need to do many times, and you know the upper bound, this solution will be much faster than using math.lgmamma or scipy.special.gammaln once you precompute cs to the upper bound.
If this is a one-time call, here is the obligatory one-liner:
np.cumsum(np.log(np.arange(2, np.max(integers))))[integers - 2]
You can do most of the operations in-place if memory is a concern (I think it also makes them faster):
mi = integers.max()
cs = np.arange(2, mi + 1)
np.cumsum(np.log(cs, out=cs), out=cs)
Im working on statistical mechanics currently, and trying to apply some programming to it since they fit so well together! Im working on finding the partition function for a finite number of particles. However..the partition function is defined as a sum of a sum! I guess we could write this as a list of a list, so we would use nested for-loops, but i just cant quite figure out the correct way of writing it.
Z=\sum_{s_1}^{s_N}e^(s_1s_2+...+s_(N-1)s_N) is the partition function.
the possible values of s_i are -1,+1.
Effectively the ising model(1D) is a chain with N points on it and each point can have s_i=-1 or +1. The energy of the system depends on the values of s_i, and each possible combination is called a state. the total sum of these states is called Z, the partition fucntion.
So for a chain of length N=5(hence 2^5=32 possible states) how would i calculate this Z? I dont really have any code to show, but i know from the formula the result should be something like e^(+1+1+1+1+1)+e^(-1+1+1+1+1)+...+e^(-1-1-1-1-1). The question is..how on earth do I go about doing that? Ive generate the set of possible states:
import itertools
counting=0
for state in itertools.product([1,-1],repeat=5):
print(state)
counting+=1
print('the total possible number of states is',counting).
but how can i use this to get to a value for Z?
I'd use a function to calculate the sum for each state, then do the overall sum afterwards:
import itertools
from math import exp
def each_state(products):
for state in products:
yield sum(state)
Z = sum(exp(x) for x in each_state(itertools.product([1,-1],repeat=5)))
The benefit of this approach is that it is in keeping with the spirit of itertools: to not aggregate everything into memory at once. So while a numpy solution might be faster, say you wanted to calculate Z for many states, a numpy implementation would start to hit memory issues whereas the generator expression will not:
from itertools import product
import numpy as np
from math import exp
# this will yield a single number, and product will yield
# each state one at a time, never aggregating the
# full set of objects into memory (even though it might seem slow)
x = sum(exp(sum(x)) for x in product([1,-1], repeat=500))
# On my 16GB MacBook, this process will be killed because
# we collect all of the states into memory
x = np.array(list(product([1, -1], repeat=500))
[1] 7743 killed python
The general rule of thumb is that list(giant_iterable) runs out of space whereas for item in giant_iterable will run out of time
Based on your description of the problem, you can calculate it using numpy as follows:
import itertools
import numpy as np
states = np.array([state for state in itertools.product([1,-1], repeat=5)])
print("There are %d states" % states.shape[0]) # 32 states
# calculate the sum for each state
sum_over_each_state = np.sum(states, axis=1)
print(sum_over_each_state)
# calculate e^(sum(state)) for each state
exp_of_all_states = np.exp(sum_over_each_state)
print(exp_of_all_states)
# sum up all exponentials
Z = np.sum(exp_of_all_states)
print("Z:", Z)
This gives Z = 279.96.
I really need help as I am stuck at the begining of the code.
I am asked to create a function to investigate the exponential distribution on histogram. The function is x = −log(1−y)/λ. λ is a constant and I referred to that as lamdr in the code and simply gave it 10. I gave N (the number of random numbers) 10 and ran the code yet the results and the generated random numbers gave me totally different results; below you can find the code, I don't know what went wrong, hope you guys can help me!! (I use python 2)
import random
import math
N = raw_input('How many random numbers you request?: ')
N = int(N)
lamdr = raw_input('Enter a value:')
lamdr = int(lamdr)
def exprand(lamdr):
y = []
for i in range(N):
y.append(random.uniform(0,1))
return y
y = exprand(lamdr)
print 'Randomly generated numbers:', (y)
x = []
for w in y:
x.append((math.log((1 - w) / lamdr)) * -1)
print 'Results:', x
After viewing the code you provided, it looks like you have the pieces you need but you're not putting them together.
You were asked to write function exprand(lambdr) using the specified formula. Python already provides a function called random.expovariate(lambd) for generating exponentials, but what the heck, we can still make our own. Your formula requires a "random" value for y which has a uniform distribution between zero and one. The documentation for the random module tells us that random.random() will give us a uniform(0,1) distribution. So all we have to do is replace y in the formula with that function call, and we're in business:
def exprand(lambdr):
return -math.log(1.0 - random.random()) / lambdr
An historical note: Mathematically, if y has a uniform(0,1) distribution, then so does 1-y. Implementations of the algorithm dating back to the 1950's would often leverage this fact to simplify the calculation to -math.log(random.random()) / lambdr. Mathematically this gives distributionally correct results since P{X = c} = 0 for any continuous random variable X and constant c, but computationally it will blow up in Python for the 1 in 264 occurrence where you get a zero from random.random(). One historical basis for doing this was that when computers were many orders of magnitude slower than now, ditching the one additional arithmetic operation was considered worth the minuscule risk. Another was that Prime Modulus Multiplicative PRNGs, which were popular at the time, never yield a zero. These days it's primarily of historical interest, and an interesting example of where math and computing sometimes diverge.
Back to the problem at hand. Now you just have to call that function N times and store the results somewhere. Likely candidates to do so are loops or list comprehensions. Here's an example of the latter:
abuncha_exponentials = [exprand(0.2) for _ in range(5)]
That will create a list of 5 exponentials with λ=0.2. Replace 0.2 and 5 with suitable values provided by the user, and you're in business. Print the list, make a histogram, use it as input to something else...
Replacing exporand with expovariate in the list comprehension should produce equivalent results using Python's built-in exponential generator. That's the beauty of functions as an abstraction, once somebody writes them you can just use them to your heart's content.
Note that because of the use of randomness, this will give different results every time you run it unless you "seed" the random generator to the same value each time.
WHat #pjs wrote is true to a point. While statement mathematically, if y has a uniform(0,1) distribution, so does 1-y appears to be correct, proposal to replace code with -math.log(random.random()) / lambdr is just wrong. Why? Because Python random module provide U(0,1) in the range [0,1) (as mentioned here), thus making such replacement non-equivalent.
In more layman term, if your U(0,1) is actually generating numbers in the [0,1) range, then code
import random
def exprand(lambda):
return -math.log(1.0 - random.random()) / lambda
is correct, but code
import random
def exprand(lambda):
return -math.log(random.random()) / lambda
is wrong, it will sometimes generate NaN/exception, as log(0) will be called
I am trying to come up with a faster way of coding what I want to. Here is the part of my program I am trying to speed up, hopefully using more inbuilt functions:
num = 0
num1 = 0
rand1 = rand_pos[0:10]
time1 = time.clock()
for rand in rand1:
for gal in gal_pos:
num1 = dist(gal, rand)
num = num + num1
time2 = time.clock()
time_elap = time2-time1
print time_elap
Here, rand_pos and gal_pos are lists of length 900 and 1 million respectively.
Here dist is function where I calculate the distance between two points in euclidean space.
I used a snippet of the rand_pos to get a time measurement.
My time measurements are coming to be about 125 seconds. This is way too long!
It means that if I run the code over all the rand_pos, it will take about three hours to do!
Is there a faster way I can do this?
Here is the dist function:
def dist(pos1,pos2):
n = 0
dist_x = pos1[0]-pos2[0]
dist_y = pos1[1]-pos2[1]
dist_z = pos1[2]-pos2[2]
if dist_x<radius and dist_y<radius and dist_z<radius:
positions = [pos1,pos2]
distance = scipy.spatial.distance.pdist(positions, metric = 'euclidean')
if distance<radius:
n = 1
return n
While most of the optimization probably needs to happen within your dist function, there are some tips here to speed things up:
# Don't manually sum
for rand in rand1:
num += sum([dist(gal, rand) for gal in gal_pos])
#If you can vectorize something, then do
import numpy as np
new_dist = np.vectorize(dist)
for rand in rand1:
num += np.sum(new_dist(gal_pos, rand))
# use already-built code whenever possible (as already suggested)
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')
There is a function in scipy that does exactly what you want to do here:
scipy.spatial.distance.cdist(gal, rand1, metric='euclidean')
It will be faster than anything you write in pure Python probably, since the heavy lifting (looping over the pairwise combinations between arrays) is implemented in C.
Currently your loop is happening in Python, which means there is more overhead per iteration, then you are making many calls to pdist. Even though pdist is very optimized, the overhead of making so many calls to it slows down your code. This type of performance issue was once described to me with a very useful analogy: its like trying to have a conversation with someone over the phone by saying one word per phone call, even though each word is going across the line very fast, your conversation will take a long time because you need to hang up and dial again repeatedly.
I have a function which updates the centroid (mean) in a K-means algoritm.
I ran a profiler and noticed that this function uses a lot of computing time.
It looks like:
def updateCentroid(self, label):
X=[]; Y=[]
for point in self.clusters[label].points:
X.append(point.x)
Y.append(point.y)
self.clusters[label].centroid.x = numpy.mean(X)
self.clusters[label].centroid.y = numpy.mean(Y)
So I ponder, is there a more efficient way to calculate the mean of these points?
If not, is there a more elegant way to formulate it? ;)
EDIT:
Thanks for all great responses!
I was thinking that perhaps I can calculate the mean cumulativly, using something like:
where x_bar(t) is the new mean and x_bar(t-1) is the old mean.
Which would result in a function similar to this:
def updateCentroid(self, label):
cluster = self.clusters[label]
n = len(cluster.points)
cluster.centroid.x *= (n-1) / n
cluster.centroid.x += cluster.points[n-1].x / n
cluster.centroid.y *= (n-1) / n
cluster.centroid.y += cluster.points[n-1].y / n
Its not really working but do you think this could work with some tweeking?
A K-means algorithm is already implemented in scipy.cluster.vq. If there is something about that implementation that you are trying to change, then I'd suggest start by studying the code there:
In [62]: import scipy.cluster.vq as scv
In [64]: scv.__file__
Out[64]: '/usr/lib/python2.6/dist-packages/scipy/cluster/vq.pyc'
PS. Because the algorithm you posted holds the data behind a dict (self.clusters) and attribute lookup (.points) you are forced to use slow Python looping just to get at your data. A major speed gain could be achieved by sticking with numpy arrays. See the scipy implementation of k-means clustering for ideas on a better data structure.
Why not avoid constructing the extra arrays?
def updateCentroid(self, label):
sumX=0; sumY=0
N = len( self.clusters[label].points)
for point in self.clusters[label].points:
sumX += point.x
sumY += point.y
self.clusters[label].centroid.x = sumX/N
self.clusters[label].centroid.y = sumY/N
The costly part of your function is most certainly the iteration over the points. Avoid it altogether by making self.clusters[label].points a numpy array itself, and then compute the mean directly on it. For example if points contains X and Y coordinates concatenated in a 1D array:
points = self.clusters[label].points
x_mean = numpy.mean(points[0::2])
y_mean = numpy.mean(points[1::2])
Without extra lists:
def updateCentroid(self, label):
self.clusters[label].centroid.x = numpy.fromiter(point.x for point in self.clusters[label].points, dtype = np.float).mean()
self.clusters[label].centroid.y = numpy.fromiter(point.y for point in self.clusters[label].points, dtype = np.float).mean()
Perhaps the added features of numpy's mean are adding a bit of overhead.
>>> def myMean(itr):
... c = t = 0
... for item in itr:
... c += 1
... t += item
... return t / c
...
>>> import timeit
>>> a = range(20)
>>> t1 = timeit.Timer("myMean(a)","from __main__ import myMean, a")
>>> t1.timeit()
6.8293311595916748
>>> t2 = timeit.Timer("average(a)","from __main__ import a; from numpy import average")
>>> t2.timeit()
69.697283029556274
>>> t3 = timeit.Timer("average(array(a))","from __main__ import a; from numpy import average, array")
>>> t3.timeit()
51.65147590637207
>>> t4 = timeit.Timer("fromiter(a,npfloat).mean()","from __main__ import a; from numpy import average, fromiter,float as npfloat")
>>> t4.timeit()
18.513712167739868
Looks like numpy's best performance came when using fromiter.
Ok, I figured out a moving average solution which is fast without changing the data structures:
def updateCentroid(self, label):
cluster = self.clusters[label]
n = len(cluster.points)
cluster.centroid.x = ((n-1)*cluster.centroid.x + cluster.points[n-1].x)/n
cluster.centroid.y = ((n-1)*cluster.centroid.y + cluster.points[n-1].y)/n
This lowered computation time (for the whole k means algorithm) to 13% of original. =)
Thank you all for some great insight!
Try this:
def updateCentroid(self, label):
self.clusters[label].centroid.x = numpy.array([point.x for point in self.clusters[label].points]).mean()
self.clusters[label].centroid.y = numpy.array([point.y for point in self.clusters[label].points]).mean()
That's the problem with profilers that only tell you about functions. This is the method I use, and it pinpoints costly lines of code, including points where functions are called.
That said, there's a general idea that data structure is free. As #Michael-Anderson asked, why not avoid making an array? That's the first thing I saw in your code, that you're building arrays by appending. You don't need to.
One way to go is add an x_sum and y_sum to your "clusters" object and sum the coordinates as points are added. If things are moving around, you can also update the sum as points move. Then getting the centroid is just a matter of dividing the x_sum and y_sum by the number of points. If your points are numpy vectors that can be added, then you don't even need to sum the components, just maintain a sum of all the vectors and multiply be 1/len at the end.