Python bug when creating matrices

Python bug when creating matrices - python

I have written a code in Python to create a transition probability matrix from the data, but I keep getting wrong values for two specific data points. I have spent several days on trying to figure out the problem, but with no success.
About the code: The input is 4 columns in csv file. After preparation of the data, the first two columns are the new and old state values. I need to calculate how often each old state value transfers to a new one (basically, how often each pair (x,y) occurs in the first two columns of the data). The values in these columns are from 0 to 99. In the trans_pr matrix I want to get a number how often a pair (x,y) occurs in the data and have this number at the corresponding coordinates (x,y) in the trans_pr matrix. Since the values are from 0 to 99 I can just add 1 to the matrix at this coordinates each time they occur in the data.
The problem: The code works fine, but I always get zeros at coordinates (:,29) and (:,58) and (29,:) and (58;:) despite having observations there. It also sometimes seems to add the number at this coordinates to the previous line. Again, doesn't make any sense to me.
I would be very grateful if anyone could help. (I am new to Python, therefore the code is probably inefficient, but only the bug is relevant.)
The code is as simple as it can be:
from numpy import *
import csv
my_data = genfromtxt('99c_test.csv', delimiter=',')
"""prepares data for further calculations"""
my_data1=zeros((len(my_data),4))
my_data1[1:,0]=100*my_data[1:,0]
my_data1[1:,1]=100*my_data[1:,3]
my_data1[1:,2]=my_data[1:,1]
my_data1[1:,3]=my_data[1:,2]
my_data2=my_data1
trans_pr=zeros((101,101))
print my_data2
"""fills the matrix with frequencies of observations"""
for i in range(len(my_data2)):
trans_pr[my_data2[i,1],my_data2[i,0]]=trans_pr[my_data2[i,1],my_data2[i,0]]+1
c = csv.writer(open("trpr1.csv", "wb"))
c.writerows(trans_pr)
You can test the code with this input (just save it as csv file):
p_cent,p_euro,p_euro_old,p_cent_old
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.28
0.01,1,1,0.29
0.01,1,1,0.3
0.01,1,1,0.57
0.01,1,1,0.58
0.01,1,1,0.59
0.01,1,1,0.6

This sound very much like a rounding issue. I'd suppose that e.g. 100*0.29 (as a floating point number) is rounded downwards (i.e. truncated) and thus yields 28 instead of 29. Try rounding the numbers by yourself (i.e. a up/down rounding) before using them as an array index.
Update: Verified my conjecture by testing it, even the numbers are as described above - see here.

You may findrint() useful, from numpy. It rounds a value to its nearest integer (see numpy.rint() doc). Have you tried the following :
for i in range(len(my_data2)):
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] = \
trans_pr[rint(my_data2[i,1]), rint(my_data2[i,0])] + 1

Related

Appending to a numpy array in for loop

I'm trying to create a Monte Carlo simulation to simulate future stock prices using Numpy arrays.
My current approach is: create a For Loop which fills an array, stock_price_array, with simulated stock prices. These stock prices are generated by taking the last stock price, then multiplying it by 1 + an annual return. The annual returns are drawn randomly from a normal distribution and stored in the array annual_ret.
My problem is that although the "stock price" variables I print from my For Loop appear to be correct, I simply cannot figure out how to Append these stock price variables to stock_price_array.
I've tried various methods, including initializing the stock_price_array using .full instead of .empty, changing the order of where the array appears in the For Loop, and checking the size of the array.
I've read other Stack Overflow posts on similar topics but can't figure out what I'm doing wrong.
Thank you in advance for your help!
annual_mean = .06
annual_stdev = .15
start_stock_price = 100
numYears = 3
numSimulations = 4
stock_price_array = np.empty(numYears)
# draw an annual return from a normal distribution; this annual return will be random
annual_ret = np.random.normal(annual_mean, annual_stdev, numSimulations)
for i in range(numYears):
stock_price = np.multiply(start_stock_price, (1 + annual_ret[i]))
np.append(stock_price_array, [stock_price])
start_stock_price = stock_price

The 1st rule of numpy is: never iterate your array yourself. Use numpy function that does all the computation in batch (and for doing so, they iterate the array, sure. But that iteration is not a python iteration, so it is way faster).
No-for solution
For example, here, you could do something like this
np.cumprod(np.hstack([start_stock_price, annual_ret+1]))
What it does is 1st building an array of a initial value, and some factors.
So if initial value is 100, and interest rate are 0.1, -0.1, 0.2, 0.2 (for example), then hstack build and array of values 100, 1.1, 0.9, 1.2, 1.2.
And the cumprod just build the cumulative product of those
100, 100×1.1=110, 100×1.1×0.9=110×0.9=99, 100×1.1×0.9×1.2=99×1.2=118.8, 100×1.1×0.9×1.2×1.2=118.8×1.2=142.56
Correction of yours
To answer to your initial question anyway (even if I strongly advise that you try to use solutions like the usage of cumprod I've shown), you have 2 choices:
Either you allocate in advance an array, as you did (your stock_price_array = np.empty(numYears)). And then, instead of trying to append the new stock_price to stock_price_array, you should simply fill one of the empty place that are already there. By simply doing stock_price_array[i] = stock_price
Or you don't. And then you replace the np.empty line by a stock_price_array=[]. And then, at each step, you do append the result to create a new stock_price_array, like this stock_price_array = np.append(stock_price_array, [stock_price])
I strongly advise against the 2nd solution. Since you already know the final size of the array, it is way better to create it once. Because np.append recreate a brand new array, then copies the input data it it. It does not just extend the existing array (generally speaking, we can't do that anyway).
But, well, anyway, I advise against both solution, since I find mine (with cumprod) preferable. for is the taboo word in numpy. And it is even more so, when what inside this for is the creation of a new array, like append is.
Monte-Carlo
Since you've mentioned Monte-Carlo, and then shown a code that compute only one result (you draw 1 set of annual ret, and perform one computation of future values), I am wondering if that is really what you want.
In particular, I see that you have numSimulation and numYears, that appear to be playing redundant roles in your code (and therefore in mines).
The only reason why it doesn't just throw a index error, is because numSimulation is used only to decide how many annual_ret you draw. And since numSimulation > numYears, you have more than enough annual_ret to compute the result.
Wasn't your initial intention to redo the simulation over the years numSimulation time, to have numSimulation results ?
In which case, you probably need numSimulation sets of numYears annual rate. So a 2D array. And like wise, you should be computing numSimulation series of numYears results.
If my guess is not completely off, I surmise that what you really wanted to do was rather in the effect of:
annual_ret = np.random.normal(annual_mean, annual_stdev, (numSimulations, numYears)) # 2d array of interest rate. 1 simulation per row, 1 year per column
t = np.pad(annual_ret+1, ((0,0), (1,0)), constant_values=start_stock_price) # Add 1 as we did earlier. And pad with an initial 100 (`start_stock_price`) at the beginning of each simulation
res = np.cumprod(t, axis=1) # cumulative multiplication. `axis=1` means that it is done along axis 1 (along years) for each row (for each simulation)

Sort unknown length array within unknown length 2D array - Python

I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!

Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.

This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?

Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.

Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())

The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average

Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!

Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

how do I detect zero-vectors that make k-means cosine crash Matlab?

I'm running kmeans on a large dataset and I'm always getting the error below:
Error using kmeans (line 145)
Some points have small relative magnitudes, making them effectively zero.
Either remove those points, or choose a distance other than 'cosine'.
Error in runkmeans (line 7)
[L, C]=kmeans(data, 10, 'Distance', 'cosine', 'EmptyAction', 'drop')
My problem is that even when I add a 1 to all the vectors, I still get this error. I would expect it to pass then, but apparently there are too many zero's still (that is what is causing it, right?).
My question is this: what is the condition that makes Matlab decide that a point has "a small relative magnitude" and "is effectively zero"?
I want to remove all these points from my dataset using python, before I hand over the data to Matlab, because I need to compare my results with a gold standard that I process in python.
Thanks in advance!
EDIT-ANSWER
The correct answer was given below, but in case someone finds this question through Google, here's how you remove the "effectively zero-vectors" from your matrix in python. Every row (!) is a data point, so you want to transpose in python or Matlab if you're running kmeans:
def getxnorm(data):
return np.sqrt(np.sum(data ** 2, axis=1))
def remove_zero_vector(data, startxnorm, excluded=[]):
eps = 2.2204e-016
xnorm = getxnorm(data)
if np.min(xnorm) <= (eps * np.max(xnorm)):
local_index=np.transpose(np.where(xnorm == np.min(xnorm)))[0][0]
global_index=np.transpose(np.where(startxnorm == np.min(xnorm)))[0][0]
data=np.delete(data, local_index, 0) # data with zero vector removed
excluded.append(global_index) # add global index to list of excluded vectors
return remove_zero_vector(data, startxnorm, excluded)
else:
return (data, excluded)
I'm sure there's a much more scipythonic way for doing this, but it'll do :-)

If you're using this kmeans, then the relevant code that is throwing the error is:
case 'cosine'
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps * max(Xnorm))
error(['Some points have small relative magnitudes, making them ', ...
'effectively zero.\nEither remove those points, or choose a ', ...
'distance other than ''cosine''.'], []);
end
So there's your test.
As you can see, what's important is relative size, so adding one to everything only makes things worse (max(Xnorm) is getting larger too). A good fix might be to scale all the data by a constant.

In your other question it looked like your data was scalar. If your input vectors only have one feature/dimension the cosine distance between them will always be undefined (or zero) because by definition they are pointing in the same direction (along the single axis). The cosine measure gives the angle between two vectors, which can only be non-zero if the vectors can point in different directions (ie dimension > 1).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.