I´m currently writing a program in python for calculating the total spectral emissivity (infrared waves) of any given material at different temperatures (200K - 500K), based on measurement data received by measuring the directional - hemispherical emissivity of the material at many different wavelengths using an IR - spectroscope. The calculation is done by integrating the measured intensity over all wavelenghts, using Plancks law as a weighing function (all of this doesn´t really matter for my question itself; i just want to explain the background so that the code is easier to understand). This is my code:
from scipy import integrate
from scipy.interpolate import CubicSpline
import numpy as np
import math as m
def planck_blackbody(lambda_, T): # wavelength, temperature
h = float(6.6260755e-34)
c = float(2.99792458e+8)
k = float(1.380658e-23)
try:
a = 2.0 * h * (c ** 2)
b = h * c / (lambda_ * k * T)
intensity = a / ((lambda_ ** 5) * (m.exp(b) - 1.0))
return float(intensity)
except OverflowError: # for lower temperatures
pass
def spectral_emissivity(emifilename, t, lambda_1, lambda_2):
results = []
with open(emifilename, 'r') as emifile:
emilines = emifile.readlines()
try:
w = [float(x.split('\t')[0].strip('\n')) * 1e-6 for x in emilines]
e = [float(x.split('\t')[1].strip('\n')) for x in emilines]
except ValueError:
pass
w = np.asarray(w) # wavelength
e = np.asarray(e) # measured intensity
def part_1(lambda_, T):
E = interp1d(w, e, fill_value = 'extrapolate')(lambda_)
return E * planck_blackbody(lambda_, T)
def E_complete(T):
E_complete_part_1 = integrate.quad(part_1, lambda_1, lambda_2, args=T, limit=50)
E_complete_part_2 = integrate.quad(planck_blackbody, lambda_1, lambda_2, args=T, limit=50)
return E_complete_part_1[0] / E_complete_part_2[0]
for T in t:
results.append([T, E_complete(T)])
with open("{}.plk".format(emifilename[:-4]), 'w') as resultfile:
for item in results:
resultfile.write("{}\t{}\n".format(item[0], item[1]))
t = np.arange(200, 501, 1)
spectral_emissivity('C:\test.dat', t, 1.4e-6, 35e-6)
The measured intensity is stored in a text file with two columns, the first being the wavelength of the infrared waves and the second being the directional-hemispherical emissivity of the measured material at that wavelength.
When i run this code, while it is producing the right results, i still encounter 2 problems:
I get an error message from scipy.integrate.quad:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
Can someone explain to me what exactly this means? I understand that integrate.quad is a numerical iteration method and that my functions somehow seem to require more than 50 iterations, but is there a way around this? i tried increasing the Limit, but even with 200 i still get this error message... it´s especially weird given that the integrands are pretty straightforward functions...
is closely connected to the first problem: this program takes ages (about 5 minutes!) to finish one single file, but i need to process many files every hour. cProfile reveals that 98% percent of this time is spent inside the integraion function. A MathCad program doing the exact same thing and producing the same outputs only takes some seconds to finish. Even though i spent the last week seatching for a solution, i simply don´t manage to speed this program up, and no one else on stackoverflow and elsewhere seems to have comparable time problems with integrate.quad.
So, finally, my question: is there any obvious way to optimize this code for it to run faster (except from compiling it into C+ or anything like that)? I tried reducing all floats to 6 digits (i can´t go any lower in accuracy) but that didn´t change anything.
Update: looking into it some more, i figured out that most of the time wasn´t actually consumed by the Integration itself, but by the CubicSpline operation that i used to interpolate my data. I tried out different methods and CubicSpline seemed to be the only working one for some reason (even though my data is monotonically increasing, i got errors from every other method i tried, saying that some values were either above or below the interpolation range). That is, until i found out about extrapolation with scipy.interpolate.interp1dand (fill_value = 'extrapolate'. Ths did the trick for me, enabling me to use the far less consuming interp1d method and effectively reducing the runtime of my program from 280 to 49 seconds (also added list comprehension for w and e). While this is a big improvement, i still wonder why my program takes nearly 1 Minute to calculate some integrals... and i still get the above mentioned IntegrationWarning. So any advice is highly appreciated!
(btw, since i am pretty new to python, I´m happy about any tips or critique i can get!)
Related
I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.
I'm performing an ADF test on several(~500) time series to test stationarity. So I need a quantitative way of choosing the correct number of lags for each one of them. One possible approach is to use, say, 80% of my sample for the test and to get the parameters of the regression in it and compute the ssr (sum of squares regression) and search for the minimum. However, this may lead to over fitting and in order to avoid it, this regression can then be applied to the remaining 20% and calculate the ssr of this sub-sample. The number of lags that lead to the minimum value of this second ssr, should be the correct one.
The problem is that statsmodels documentation is less than incomplete (at least for a newbie like me!). For example, given the line
res = ts.adfuller(dUs, maxlag=max_lag_, autolag=None, regression='ct', store=True, regresults=True)
the regression coefficients are stored in res[3].resols.params but the order is unknown. I had to ask someone to run the test on one of my time series in R (which gives you the used formula and the corresponding coefficients like this
R-output).
Python order of parameters seems to be (for a 'ct' regression) lag 1, lag diff 1, lag diff 2, ...lag diff N, intercept, time trend. I, then, re-construct the fitted series with the following code:
xFit[0:max_lag_ + 1] = dUs[0:max_lag_ + 1]
for i in range (max_lag_ + 1,xFit.size):
xFit[i] = xFit[i-1] + res[3].resols.params[0] * xFit[i-1] + res[3].resols.params[res[3].resols.params.size - 2] + res[3].resols.params[res[3].resols.params.size - 1] * t[i]
for j in range(1,max_lag_ +1):
xFit[i] = xFit[i] + res[3].resols.params[j] * lag[i-1-j]
Note that the lag variable is constructed from my dUs variable like this
lag = dUs[1:]-dUs[:-1]
The thing is that the xFit series and res[3].resols.fittedvalues are different! I think that it might has something to do with my initialization of the first max_lag_ data points (in fact, note that the res[3].resols.fittedvalues is max_lag_ + 1 shorter than the original series): I chose them to be equal to the original series. But I can' t figure out what exactly is going on. The difference between xFit and res[3].resols.params is HUGE:
time-series-comparison. Note also that, increasing the lag number makes my fitting better up to some value, and then the series explodes. This does not happen with fittedvalues!
As a final test, I ran the ADF test on xFit; I understand this should lead to the res[3].resols.params I already got.
Given the line
res2 = ts.adfuller(xFit, maxlag=max_lag_, autolag=None, regression='ct', store=True, regresults=True)
the output of res2[3].resols.params is
[ -1.60231256e+00 4.23814175e-02 -4.15837300e-02 4.99642618e-02
-6.92483339e+02 3.89141878e+00]
while res[3].resols.params is
[ -1.29269094e+00 2.11857016e-02 -5.82679110e-02 -2.09614163e-02
-5.44413351e+02 2.69502722e+00]
I know that many of you would suggest to move to R but, a) I never used it (although I could learn) and b) getting software installed at work is not that easy and it could take me a lot of precious time.
Any ideas? any mistake I'm missing?
Thanks in advance,
C
I solved the issue (although I didn't had time to post it before!).
The thing is that R fits the difference series while Python fits the time series itself. That in combination with an error (xFit should be replaced with dUs in the reconstruction of the time series!) made everything weird as explained above.
The right code is
for i in range (max_lag_ + 1,xFit.size):
xFit[i] = res[3].resols.params[0] * dUs[i-1] + res[3].resols.params[res[3].resols.params.size - 1]
for j in range(1,max_lag_ +1):
xFit[i] = xFit[i] + res[3].resols.params[j] * lag[i-1-j]
I really need help as I am stuck at the begining of the code.
I am asked to create a function to investigate the exponential distribution on histogram. The function is x = −log(1−y)/λ. λ is a constant and I referred to that as lamdr in the code and simply gave it 10. I gave N (the number of random numbers) 10 and ran the code yet the results and the generated random numbers gave me totally different results; below you can find the code, I don't know what went wrong, hope you guys can help me!! (I use python 2)
import random
import math
N = raw_input('How many random numbers you request?: ')
N = int(N)
lamdr = raw_input('Enter a value:')
lamdr = int(lamdr)
def exprand(lamdr):
y = []
for i in range(N):
y.append(random.uniform(0,1))
return y
y = exprand(lamdr)
print 'Randomly generated numbers:', (y)
x = []
for w in y:
x.append((math.log((1 - w) / lamdr)) * -1)
print 'Results:', x
After viewing the code you provided, it looks like you have the pieces you need but you're not putting them together.
You were asked to write function exprand(lambdr) using the specified formula. Python already provides a function called random.expovariate(lambd) for generating exponentials, but what the heck, we can still make our own. Your formula requires a "random" value for y which has a uniform distribution between zero and one. The documentation for the random module tells us that random.random() will give us a uniform(0,1) distribution. So all we have to do is replace y in the formula with that function call, and we're in business:
def exprand(lambdr):
return -math.log(1.0 - random.random()) / lambdr
An historical note: Mathematically, if y has a uniform(0,1) distribution, then so does 1-y. Implementations of the algorithm dating back to the 1950's would often leverage this fact to simplify the calculation to -math.log(random.random()) / lambdr. Mathematically this gives distributionally correct results since P{X = c} = 0 for any continuous random variable X and constant c, but computationally it will blow up in Python for the 1 in 264 occurrence where you get a zero from random.random(). One historical basis for doing this was that when computers were many orders of magnitude slower than now, ditching the one additional arithmetic operation was considered worth the minuscule risk. Another was that Prime Modulus Multiplicative PRNGs, which were popular at the time, never yield a zero. These days it's primarily of historical interest, and an interesting example of where math and computing sometimes diverge.
Back to the problem at hand. Now you just have to call that function N times and store the results somewhere. Likely candidates to do so are loops or list comprehensions. Here's an example of the latter:
abuncha_exponentials = [exprand(0.2) for _ in range(5)]
That will create a list of 5 exponentials with λ=0.2. Replace 0.2 and 5 with suitable values provided by the user, and you're in business. Print the list, make a histogram, use it as input to something else...
Replacing exporand with expovariate in the list comprehension should produce equivalent results using Python's built-in exponential generator. That's the beauty of functions as an abstraction, once somebody writes them you can just use them to your heart's content.
Note that because of the use of randomness, this will give different results every time you run it unless you "seed" the random generator to the same value each time.
WHat #pjs wrote is true to a point. While statement mathematically, if y has a uniform(0,1) distribution, so does 1-y appears to be correct, proposal to replace code with -math.log(random.random()) / lambdr is just wrong. Why? Because Python random module provide U(0,1) in the range [0,1) (as mentioned here), thus making such replacement non-equivalent.
In more layman term, if your U(0,1) is actually generating numbers in the [0,1) range, then code
import random
def exprand(lambda):
return -math.log(1.0 - random.random()) / lambda
is correct, but code
import random
def exprand(lambda):
return -math.log(random.random()) / lambda
is wrong, it will sometimes generate NaN/exception, as log(0) will be called
So i am doing this assignment, where i am supposed to minimize the chi squared function. I saw someone doing this on the internet so i just copied it:
Multiple variables in SciPy's optimize.minimize
I made a chi-squared function which is a function in 3 variables (x,y,sigma) where sigma is random gaussian fluctuation random.gauss(0,sigma). I did not print that code here because on first sight it might be confusing (I used a lot of recursion). But i can assure you that this function is correct.
now this code just makes a list of the calculated minimization(Which are different every time because of the random gaussian fluctuation). But here comes the main problem. If i did my calculation correctly, we should get a list with a mean of 2 (since i have 2 degrees of freedom as you can see in this link: https://en.wikipedia.org/wiki/Chi-squared_test).
def Chi2(pos):
return Chi(pos[0],pos[1],1)
x_list= []
y_list= []
chi_list = []
for i in range(1000):
result = scipy.optimize.minimize(Chi2,[5,5]).x
x_list.append(result[0])
y_list.append(result[1])
chi_list.append(Chi2(result))
But when i use this code i get a list of mean 4, however if i add the method "Powell" i get a mean of 9!!
So my main question is, how is it possible these means are so different and how do i know which method to use to get the best optimization?
Because i think the error might be in my chisquare function i will show this one as well. The story behind this assignment is that we need to find the position of a mobile device and we have routers on the positions (0,0),(20,0),(0,20) and (20,20). We used a lot of recursion, and the graph of the chi_squared looked fine(it has a minimum on (5,5)
def perfectsignal(x_m,y_m,x_r,y_r):
return 20*np.log10(c / (4 * np.pi * f)) - 10 * np.log((x_m-x_r)**2 + (y_m-y_r)**2 + 2**2)
def signal(x_m,y_m,x_r,y_r,sigma):
return perfectsignal(x_m,y_m,x_r,y_r) + random.gauss(0,sigma)
def res(x_m,y_m,x_r,y_r,sigma,sigma2):
x = (signal(x_m,y_m,x_r,y_r,sigma) - perfectsignal(x_m,y_m,x_r,y_r))/float(sigma2);
return x
def Chi(x,y,sigma):
return(res(x,y,0,0,sigma,1)**2+res(x,y,20,0,sigma,1)**2+res(x,y,0,20,sigma,1)**2+res(x,y,20,20,sigma,1)**2)
Kees
I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)
Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.
How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.
I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!