Minimizing discrete vector/array without knowing function?

Minimizing discrete vector/array without knowing function? - python

I am trying to find the minimum of the second derivative for the elements in the array below:
scores = [-100.07, -40.04, -26.97, -17.31, -13.12, -9.02, -7.22,
-5.23, -4.37, -3.44, -2.92, -2.36, -2.11, -1.78,
-1.59, -1.37, -1.23, -1.1 , -0.97, -0.87]
Plotting these data when k is 20 results in the plot below. I've circled in red the "optimal" number of clusters based on this score measure.
The issue, though, is that I had to manually intervene (i.e. the computer didn't select this point of minimization). I would rather have the computer select the appropriate k, so I'm trying to find the point at which the second derivative is smallest.
I have tried differencing, e.g.
import numpy as np
first_diff = np.diff(scores, 1)
second_diff = np.diff(scores, 2)
But this isn't satisfactory, as the sequence of second differences produces some positive numbers and then some negative numbers, which won't produce the results I want when using np.argmin. Using percent changes doesn't work well, either.
Is there a reliable way to difference these types of vectors?

Related

Is it possible to make the mean of the list of numbers generated from the normal distribution become exactly zero?

So, even in the absence of errors due to the finite machine precision it would be mathematically incorrect to think of a scenario where finite number of points sampled from a Gaussian distribution give exactly zero mean always. One would truly need an infinite number of points for this to be exactly true.
Nonetheless, I am manually (in an ad hoc manner) trying to center the distribution so that the mean is at zero. For that I first generate a gaussian distribution, find it's mean and then shift each point with that mean. By doing this I take the mean very close to zero but then I encounter a small value close to the machine precision (of the order 10**(-17)) and I do not know how to make it exactly zero.
Here is the code I used:
import numpy as np
n=10000
X=np.random.normal(0,1,size=(2,n))[0,:]
Xm=np.mean(X)
print("Xm = ", Xm)
Y=np.random.normal(0,1,size=(2,n))[1,:]
Ym=np.mean(Ym)
print("Ym = ", Y)
for i in range(len(X)):
X[i]=X[i]-Xm
Y[i]=Y[i]-Ym
new_X=np.mean(X)
new_Y=np.mean(Y)
print(new_X)
print(new_Y)
Output:
Zreli = 0.002713682499601005
Preli = -0.0011499576497770079
-3.552713678800501e-18
2.2026824808563105e-17

I am not good with code but mathematically you could have a while loop that checks for the sum of the numbers you have to not be 0. If it isn't 0 you would add 1 to the lowest unit you allow.

How Can I find the min and max values of y and draw a border between them in python?

I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.

1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.

Random numbers with user-defined continuous probability distribution

I would like to simulate something on the subject of photon-photon-interaction. In particular, there is Halpern scattering. Here is the German Wikipedia entry on it Halpern-Streuung. And there the differential cross section has an angular dependence of (3+(cos(theta))^2)^2.
I would like to have a generator of random numbers between 0 and 2*Pi, which corresponds to the density function ((3+(cos(theta))^2)^2)*(1/(99*Pi/4)). So the values around 0, Pi and 2*Pi should occur a little more often than the values around Pi/2 and 3.
I have already found that there is a function on how to randomly output discrete values with user-defined probability values numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2]). I could work with that in an emergency, should there be nothing else. But actually I already want a continuous probability distribution here.
I know that even if there is such a Python command where you can enter a mathematical distribution function, it basically only produces discrete distributions of values, since no irrational numbers with 1s and 0s can be represented. But still, such a command would be more elegant with a continuous function.

Assuming the density function you have is proportional to a probability density function (PDF) you can use the rejection sampling method: Draw a number in a box until the box falls within the density function. It works for any bounded density function with a closed and bounded domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). In this case, the bound is 64/(99*math.pi) and the algorithm works as follows:
import math
import random
def sample():
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
while True: # Do the following until a value is returned
# Choose an X inside the desired sampling domain.
x=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
pdf=(((3+(math.cos(x))**2)**2)*(1/(99*math.pi/4)))
# Does (x,y) fall in the PDF?
if y<pdf:
# Yes, so return x
return x
# No, so loop
See also the section "Sampling from an Arbitrary Distribution" in my article on randomization.
The following shows the method's correctness by showing the probability that the returned sample is less than π/8. For correctness, the probability should be close to 0.0788:
print(sum(1 if sample()<math.pi/8 else 0 for _ in range(1000000))/1000000)

I had two suggestions in mind. The inverse transform sampling method and the "Deletion metode" (I'll just call it that). The inverse transform sampling method: There is an inverse function to my distribution. But I get problems in several places with the math. functions because of the domain. E.g. math.sqrt(-1). You would still have to trick around with if-queries here.That's why I decided to use Peter's suggestion.
And if you collect values in a loop and plot them in a histogram, it also looks quite good. Here with 40000 values and 100 bins
Here is the whole code for someone who is interested
import numpy as np
import math
import random
import matplotlib.pyplot as plt
N=40000
bins=100
def Deletion_method():
x=None
while x==None:
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
# Choose an X inside the desired sampling domain.
xrad=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
P=((3+(math.cos(xrad))**2)**2)*(1/(99*math.pi/4))
# Does (x,y) fall in the PDF?
if y<P:
x=xrad
return(x)
Values=[]
for k in range(0, N):
Values=np.append(Values, [Deletion_method()])
plt.hist(Values, bins)
plt.show()

Question about numpy correlate: not giving expected result

I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks

That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:

Problems with computing the joint probability mass function with np.histogram2d

I currently have a 4024 by 10 array - where column 0 represent the 4024 different returns of stock 1, column 1 the 4024 returns of stock 2 and so on - for an assignment for my masters where I'm asked to compute the entropy and joint entropy of the different random variables (each random variable obviously being the stock returns). However, these entropy calculations both require the calculation of P(x) and P(x,y). So far I've managed to successfully compute the individual empirical probabilities using the following code:
def entropy(ret,t,T,a,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
asset_returns=returns_mat[:,a]
hist,bins=np.histogram(asset_returns,bins=n)
empirical_prob=hist/hist.sum()
entropy_vector=np.empty(len(empirical_prob))
for i in range(len(empirical_prob)):
if empirical_prob[i]==0:
entropy_vector[i]=0
else:
entropy_vector[i]=-empirical_prob[i]*np.log2(empirical_prob[i])
shannon_entropy=np.sum(entropy_vector)
return shannon_entropy, empirical_prob
P.S. ignore the whole entropy part of the code
As you can see I've simply done the 1d histogram and then divided each count by the total sum of the histogram results in order to find the individual probabilities. However, I'm really struggling with how to go about computing P(x,y) using
np.histogram2d()
Now, obviously P(x,y)=P(x)*P(y) if the random variables are independent, but in my case they are not, as these stocks belong to the same index, and therefore posses some positive correlation, i.e. they're dependent, so taking the product of the two individual probabilities does not hold. I've tried following the suggestions of my professor, where he said:
"We had discussed how to get the empirical pdf for a univariate distribution: one defines the bins and then counts simply how many observations are in the respective bin (relative to the total number of observations). For bivariate distributions you can do the same, but now you make 2-dimensional binning (check for example the histogram2 command in matlab)"
As you can see he's referring to the 2d histogram function of MATLAB, but I've decided to do this assignment on Python, and so far I've elaborated the following code:
def jointentropy(ret,t,T,a,b,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
assetA=returns_mat[:,a]
assetB=returns_mat[:,b]
hist,bins1,bins2=np.histogram2d(assetA,assetB,bins=n)
But I don't know what to do from here, because
np.histogram2d()
returns a 4025 by 4025 array as well as the two separate bins, so I don't know what I can do to compute P(x,y) for my two dependent random variables.
I've tried to figure this out for hours without any luck or success, so any kind of help would be highly appreciated! Thank you very much in advance!

Looks like you've got a clear case of conditional or Bayesian probability on your hands. You can look it up, for example, here, http://www.mathgoodies.com/lessons/vol6/dependent_events.html, which gives the probability of both events occurring as P(x,y) = P(x) · P(x|y), where P(x|y) is "probability of event x given y". This should apply in your situation because, if two stocks are from the same index, one price cannot happen without the other. Just build two separate bins like you did for one and calculate probabilities as above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.