I am using Python to sum random star spectra to increase their signal to noise ratio. One of the keywords of these spectra header contains the integration time for the spectrum. When I sum the spectra I want the keyword of the resulting spectrum to be updated with the sum of the integration times of each spectrum I used. For that, I use the following code:
for kk in range(0,NumberOfSpectra): # main cycle
TotalIntegrationTime = 0.0
for item in RandomSpectraList: # secondary cycle
SpectrumHeader = SpectraFullList[item]['head'] #1
TotalIntegrationTime += SpectrumHeader['EXPTIME']
SpectrumHeader['EXPTIME'] = TotalIntegrationTime #2
SaveHeaderFunction(SpectrumHeader, kk)
the problem I am having, is that when the main cycle loops, SpectrumHeader does not get reset when I re-assign it in #1 and shows the value it had in #2. Any ideas on why this happens and how to fix it?
NumberOfSpectra is provided by the user, RandomSpectraList is a list of random spectra by name. SpectraFullList contains the spectra and has keys 'head' and 'spec'.
Are you aware of the fact during line #2, SpectrumHeader still points to an element of SpectraFullList? They are the really the same object.
So, when executing line #2 you are essentially modifying SpectraFullList.
I guess that is not what you want and it may be the cause of your problem.
In order to solve it, insert the following line before #2:
SpectrumHeader = SpectraFullList[item]['head'].copy()
Related
I'm trying to create a Monte Carlo simulation to simulate future stock prices using Numpy arrays.
My current approach is: create a For Loop which fills an array, stock_price_array, with simulated stock prices. These stock prices are generated by taking the last stock price, then multiplying it by 1 + an annual return. The annual returns are drawn randomly from a normal distribution and stored in the array annual_ret.
My problem is that although the "stock price" variables I print from my For Loop appear to be correct, I simply cannot figure out how to Append these stock price variables to stock_price_array.
I've tried various methods, including initializing the stock_price_array using .full instead of .empty, changing the order of where the array appears in the For Loop, and checking the size of the array.
I've read other Stack Overflow posts on similar topics but can't figure out what I'm doing wrong.
Thank you in advance for your help!
annual_mean = .06
annual_stdev = .15
start_stock_price = 100
numYears = 3
numSimulations = 4
stock_price_array = np.empty(numYears)
# draw an annual return from a normal distribution; this annual return will be random
annual_ret = np.random.normal(annual_mean, annual_stdev, numSimulations)
for i in range(numYears):
stock_price = np.multiply(start_stock_price, (1 + annual_ret[i]))
np.append(stock_price_array, [stock_price])
start_stock_price = stock_price
The 1st rule of numpy is: never iterate your array yourself. Use numpy function that does all the computation in batch (and for doing so, they iterate the array, sure. But that iteration is not a python iteration, so it is way faster).
No-for solution
For example, here, you could do something like this
np.cumprod(np.hstack([start_stock_price, annual_ret+1]))
What it does is 1st building an array of a initial value, and some factors.
So if initial value is 100, and interest rate are 0.1, -0.1, 0.2, 0.2 (for example), then hstack build and array of values 100, 1.1, 0.9, 1.2, 1.2.
And the cumprod just build the cumulative product of those
100, 100×1.1=110, 100×1.1×0.9=110×0.9=99, 100×1.1×0.9×1.2=99×1.2=118.8, 100×1.1×0.9×1.2×1.2=118.8×1.2=142.56
Correction of yours
To answer to your initial question anyway (even if I strongly advise that you try to use solutions like the usage of cumprod I've shown), you have 2 choices:
Either you allocate in advance an array, as you did (your stock_price_array = np.empty(numYears)). And then, instead of trying to append the new stock_price to stock_price_array, you should simply fill one of the empty place that are already there. By simply doing stock_price_array[i] = stock_price
Or you don't. And then you replace the np.empty line by a stock_price_array=[]. And then, at each step, you do append the result to create a new stock_price_array, like this stock_price_array = np.append(stock_price_array, [stock_price])
I strongly advise against the 2nd solution. Since you already know the final size of the array, it is way better to create it once. Because np.append recreate a brand new array, then copies the input data it it. It does not just extend the existing array (generally speaking, we can't do that anyway).
But, well, anyway, I advise against both solution, since I find mine (with cumprod) preferable. for is the taboo word in numpy. And it is even more so, when what inside this for is the creation of a new array, like append is.
Monte-Carlo
Since you've mentioned Monte-Carlo, and then shown a code that compute only one result (you draw 1 set of annual ret, and perform one computation of future values), I am wondering if that is really what you want.
In particular, I see that you have numSimulation and numYears, that appear to be playing redundant roles in your code (and therefore in mines).
The only reason why it doesn't just throw a index error, is because numSimulation is used only to decide how many annual_ret you draw. And since numSimulation > numYears, you have more than enough annual_ret to compute the result.
Wasn't your initial intention to redo the simulation over the years numSimulation time, to have numSimulation results ?
In which case, you probably need numSimulation sets of numYears annual rate. So a 2D array. And like wise, you should be computing numSimulation series of numYears results.
If my guess is not completely off, I surmise that what you really wanted to do was rather in the effect of:
annual_ret = np.random.normal(annual_mean, annual_stdev, (numSimulations, numYears)) # 2d array of interest rate. 1 simulation per row, 1 year per column
t = np.pad(annual_ret+1, ((0,0), (1,0)), constant_values=start_stock_price) # Add 1 as we did earlier. And pad with an initial 100 (`start_stock_price`) at the beginning of each simulation
res = np.cumprod(t, axis=1) # cumulative multiplication. `axis=1` means that it is done along axis 1 (along years) for each row (for each simulation)
I have an array D of variable length,
I want to create a loop that performs a sum based on the value of D corresponding to the number of times looped
i.e. the 5th run through the loop would use the 5th value in my array.
My code is:
period = 63 # can be edited to an input() command for variable periods.
Mrgn_dec = .10 # decimal value of 10%, can be manipulated to produce a 10% increase/decrease
rtn_annual = np.arange(0.00,0.15,0.05) # creates an array ??? not sure if helpful
sig_annual = np.arange(0.01,0.31,0.01) #use .31 as python doesnt include the upper range value.
#functions for variables of daily return and risk.
rtn_daily = (1/252)*rtn_annual
sig_daily = (1/(np.sqrt(252)))*sig_annual
D=np.random.normal(size=period) # unsure of range to use for standard distribution
for i in range(period):
r=(rtn_daily+sig_daily*D)
I'm trying to make it so my for loop is multiplied by the value for D of each step.
So D has a random value for every value of period, where period represents a day.
So for the 8th day I want the loop value for r to be multiplied by the 8th value in my array, is there a way to select the specific value or not?
Does the numpy.cumprod command offer any help, I'm not sure how it works but it has been suggested to help the problem.
You can select element in an iterative object (such as D in your code) simply by choosing its index. Such as:
for i in range(period):
print D[i]
But in your code, rtn_daily and sig_daily are not in the same shape, I assume that you want to add sig_daily multiply by D[i] in each position of rtn. so try this:
# -*- coding:utf-8 -*-
import numpy as np
period = 63 # can be edited to an input() command for variable periods.
Mrgn_dec = .10 # decimal value of 10%, can be manipulated to produce a 10% increase/decrease
rtn_annual = np.repeat(np.arange(0.00,0.15,0.05), 31) # creates an array ??? not sure if helpful
sig_annual = np.repeat(np.arange(0.01,0.31,0.01), 3) #use .31 as python doesnt include the upper range value.
#functions for variables of daily return and risk.
rtn_daily = (float(1)/252)*rtn_annual
sig_daily = (1/(np.sqrt(252)))*sig_annual
D=np.random.normal(size=period) # unsure of range to use for standard distribution
print D
for i in range(period):
r=(rtn_daily[i]+sig_daily[i]*D[i])
print r
Last of all, if you are using python2, the division method is for integer, so that means 1/252 will give you zero as result.
a = 1/252 >-- 0
to solve this you may try to make it float:
rtn_daily = (float(1)/252)*rtn_annual
Right now, D is just a scalar.
I'd suggest reading https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html to learn about the parameters.
If you change it to:
D=np.random.normal(mean,stdev,period)
you will get a 1D array with period number of samples, where mean and stdev are your mean and standard deviation of the distribution. Then you change the loop to:
for i in range(period):
r=(rtn_daily+sig_daily*D[i])
EDIT: I don't know what I was thinking when I read the code the first time. It was a horribly bad read on my part.
Looking back at the code, a few things need to happen to make it work.
First:
rtn_annual = np.arange(0.00,0.15,0.05)
sig_annual = np.arange(0.01,0.31,0.01)
These two lines need to be fixed so that the dimensions of the resulting matricies are the same.
Then:
rtn_daily = (1/252)*rtn_annual
Needs to be changed so it doesn't zero everything out -- either change 1 to 1.0 or float(1)
Finally:
r=(rtn_daily+sig_daily*D)
needs to be changed to:
r=(rtn_daily+sig_daily*D[i])
I'm not really sure of the intent of the original code, but it appears as though the loop is unnecessary and you could just change the loop to:
r=(rtn_daily+sig_daily*D[day])
where day is the day you're trying to isolate.
I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.
Given that we have two lines on a graph (I just noticed that I inverted the numbers on the Y axis, this was a mistake, it should go from 11-1)
And we only care about whole number X axis intersections
We need to order these points from highest Y value to lowest Y value regardless of their position on the X axis (Note I did these pictures by hand so they may not line up perfectly).
I have a couple of questions:
1) I have to assume this is a known problem, but does it have a particular name?
2) Is there a known optimal solution when dealing with tens of billions (or hundreds of millions) of lines? Our current process of manually calculating each point and then comparing it to a giant list requires hours of processing. Even though we may have a hundred million lines we typically only want the top 100 or 50,000 results some of them are so far "below" other lines that calculating their points is unnecessary.
Your data structure is a set of tuples
lines = {(y0, Δy0), (y1, Δy1), ...}
You need only the ntop points, hence build a set containing only
the top ntop yi values, with a single pass over the data
top_points = choose(lines, ntop)
EDIT --- to choose the ntop we had to keep track of the smallest
one, and this is interesting info, so let's return also this value
from choose, also we need to initialize decremented
top_points, smallest = choose(lines, ntop)
decremented = top_points
and start a loop...
while True:
Generate a set of decremented values
decremented = {(y-Δy, Δy) for y, Δy in top_points}
decremented = {(y-Δy, Δy) for y, Δy in decremented if y>smallest}
if decremented == {}: break
Generate a set of candidates
candidates = top_lines.union(decremented)
generate a new set of top points
new_top_points, smallest = choose(candidates, ntop)
The following is no more necessary
check if new_top_points == top_points
if new_top_points == top_points: break
top_points = new_top_points</strike>
of course we are in a loop...
The difficult part is the choose function, but I think that this
answer to the question
How can I sort 1 million numbers, and only print the top 10 in Python?
could help you.
It's not a really complicated thing, just a "normal" sorting problem.
Usually sorting requires a large amount of computing time. But your case is one where you don't need to use complex sorting techniques.
You on both graphs are growing or falling constantly, there are no "jumps". You can use this to your advantage. The basic algorithm:
identify if a graph is growing or falling.
write a generator, that generates the values; from left to right if raising, form right to left if falling.
get the first value from both graphs
insert the lower on into the result list
get a new value from the graph that had the lower value
repeat the last two steps until one generator is "empty"
append the leftover items from the other generator.
I have written a code in Python to create a transition probability matrix from the data, but I keep getting wrong values for two specific data points. I have spent several days on trying to figure out the problem, but with no success.
About the code: The input is 4 columns in csv file. After preparation of the data, the first two columns are the new and old state values. I need to calculate how often each old state value transfers to a new one (basically, how often each pair (x,y) occurs in the first two columns of the data). The values in these columns are from 0 to 99. In the trans_pr matrix I want to get a number how often a pair (x,y) occurs in the data and have this number at the corresponding coordinates (x,y) in the trans_pr matrix. Since the values are from 0 to 99 I can just add 1 to the matrix at this coordinates each time they occur in the data.
The problem: The code works fine, but I always get zeros at coordinates (:,29) and (:,58) and (29,:) and (58;:) despite having observations there. It also sometimes seems to add the number at this coordinates to the previous line. Again, doesn't make any sense to me.
I would be very grateful if anyone could help. (I am new to Python, therefore the code is probably inefficient, but only the bug is relevant.)
The code is as simple as it can be:
from numpy import *
import csv
my_data = genfromtxt('99c_test.csv', delimiter=',')
"""prepares data for further calculations"""
my_data1=zeros((len(my_data),4))
my_data1[1:,0]=100*my_data[1:,0]
my_data1[1:,1]=100*my_data[1:,3]
my_data1[1:,2]=my_data[1:,1]
my_data1[1:,3]=my_data[1:,2]
my_data2=my_data1
trans_pr=zeros((101,101))
print my_data2
"""fills the matrix with frequencies of observations"""
for i in range(len(my_data2)):
trans_pr[my_data2[i,1],my_data2[i,0]]=trans_pr[my_data2[i,1],my_data2[i,0]]+1
c = csv.writer(open("trpr1.csv", "wb"))
c.writerows(trans_pr)
You can test the code with this input (just save it as csv file):
p_cent,p_euro,p_euro_old,p_cent_old
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.57
0.01,1,1,0.58
0.01,1,1,0.59
0.01,1,1,0.6
This sound very much like a rounding issue. I'd suppose that e.g. 100*0.29 (as a floating point number) is rounded downwards (i.e. truncated) and thus yields 28 instead of 29. Try rounding the numbers by yourself (i.e. a up/down rounding) before using them as an array index.
Update: Verified my conjecture by testing it, even the numbers are as described above - see here.
You may findrint() useful, from numpy. It rounds a value to its nearest integer (see numpy.rint() doc). Have you tried the following :
for i in range(len(my_data2)):
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] = \
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] + 1