I am new to Python, so I am not sure if this problem is due to my inexperience or whether this is a glitch.
I am running this code multiple times on the same data (no random number generation) and getting different results. This has occurred with more than one variable so far, and obviously I cannot proceed with the analysis until I figure out which results are trustworthy. Here is a short sample of the results I have obtained after running the code four times. Why is there such a discrepancy between these outputs? I am puzzled and greatly appreciate your advice.
Linear Regression
from scipy.stats import linregress
import scipy.stats
from scipy.signal import welch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
part_022_o = pd.read_excel(r'C:\Users\Me\Desktop\Behavioral Data Processed\part_022_combined_other.xlsx')
distance_o = part_022_o["distance"]
fs = 200
f, Pwelch_spec = signal.welch(distance_o, fs=fs, window='hanning',nperseg=400, noverlap=200, scaling='density', average='mean')
log_f = np.log(f, where=f>0)
log_pwelch = np.log(Pwelch_spec, where=Pwelch_spec>0)
idx = np.isfinite(log_f) & np.isfinite(log_pwelch)
polynomial_coefficients = np.polyfit(log_f[idx],log_pwelch[idx],1)
print(polynomial_coefficients)
scipy.stats.linregress(log_f[idx], log_pwelch[idx])
Results First Attempt
[ 0.00324568 -2.82962602]
Results Second Attempt
[-2.70137164 6.97117509]
Results Third Attempt
[-2.70137164 6.97117509]
Results Fourth Attempt
[-2.28028005 5.53839502]
The same thing happens when I use scipy.stats.linregress().
Thank you,
Confused
Edit: full code added.
Also, the issue appears to be related to np.log(), since only the values of "log_f" array seem to be changing with the different outputs. It is hard to be certain that nothing else is changing (e.g. log_pwelch), but differences in output clearly correspond to differences in the first value of the "log_f" array.
Edit: I have narrowed the issue down to np.log(f, where=f>0). The first value in the f array is zero. According to the documentation of numpy log, "...Note that if an uninitialized out array is created via the default out=None, locations within it where the condition is False will remain uninitialized." Apparently this means that the value or variable is unpredictable and can vary from trial to trial, which is exactly what I am observing. Given my inexperience with Python, I am not sure what the best solution is (e.g. specifying the out-array in the log function, use a random seed, just note the regression coefficients whenever the value of zero is unchanged after log, etc.)
Try to use a random seed to reproduce results. Do this with the following code at the top of your program:
import numpy as np
np.random.seed(123) or any number you want
see here for more info: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html
A random seed ensures you get repeatable results when some part of your program is generating numbers at random.
Try finding out what the functions (np.polyfit(), np.log()) are actually doing using documentation.
This is standard practice for scikit-learn and ML to use a seed value.
Related
First of all, I tried to perform dimensionality reduction on my n_samples x 53 data using scikit-learn's Kernel PCA with precomputed kernel. The code worked without any issues when I tried using 50 samples at first. However, when I increased the number of samples into 100, suddenly I got the following message.
Process finished with exit code -1073740940 (0xC0000374)
Here's the detail of what I want to do:
I want to obtain the optimum value of kernel function hyperparameter in my Kernel PCA function, defined as the following.
from sklearn.decomposition.kernel_pca import KernelPCA as drm
from somewhere import costfunction
from somewhere_else import customkernel
def kpcafun(w,X):
# X is sample
# w is hyperparam
n_princomp = 2
drmodel = drm(n_princomp,kernel='precomputed')
k_matrix = customkernel (X,X,w)
transformed_x = drmodel.fit_transform(k_matrix)
cost = costfunction(transformed_x)
return cost
Therefore, to optimize the hyperparams I used the following code.
from scipy.optimize import minimize
# assume that wstart and optimbound are already defined
res = minimize(kpcafun, wstart, method='L-BFGS-B', bounds=optimbound, args=(X))
The strange thing is when I tried to debug the first 10 iterations of the optimization process, nothing strange has happened all values of the variables seemed normal. But, when I turned off the breakpoints and let the program continue the message appeared without any error notification.
Does anyone know what might be wrong with my code? Or anyone has some tips to resolve a problem like this?
Thanks
I am implementing Andrew Ng's Machine Learning course on Python, but I got stuck because the scipy's optimize functions keep giving me a hard time by not working/giving me dimension errors
The goal is to find the minimum of the cost function (a scalar function that takes theta (dimension (1,401)), X (dimension (5000,401)), and y (dimension (5000,1)) as inputs). I have defined such cost function and its gradient wrt parameters. When running one of the optimize functions (I have tried fmin_tnc, minimize, Nelder-Mead and others, all not working), either they run for ages or keep giving me errors saying that the array dimension is wrong, or that they find a division by 0... errors that I am not able to spot.
weirdest thing is that this problem has popped up at first when I was doing exercise 2 on logistic regression, and then magically disappeared without me changing anything. Now, Implementing multi-classification logistic regression, it has appeared again, and it won't fix even though I have literally copied and pasted the code of exercise 2!
The code is the following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
import scipy.misc
import matplotlib.cm as cm
from scipy.optimize import minimize,fmin_tnc
import random
def sigmoid(z):
return 1/(1+np.exp(-z))
def J(theta,X,y):
theta_t=np.transpose(theta)
prod=np.matmul(X,theta_t)
sigm=sigmoid(prod)
vec=y*np.log(sigm)+(1-y)*np.log(1-sigm)
return -np.sum(vec)/len(y)
def grad(theta,X,y):
theta_t=np.transpose(theta)
prod=np.matmul(X,theta_t)
sigm=sigmoid(prod)
one=sigm-y
return np.matmul(np.transpose(one),X)/len(y)
data=loadmat('/home/marco/Desktop/MLang/mlex3/ex3/ex3data1.mat')
X,y = data['X'],data['y']
X=np.column_stack((np.ones(len(X[:,0])),X))
initial_theta=np.zeros((1,len(X[0,:])))
res=fmin_tnc(func=J, x0=initial_theta.flatten(), args=(X,y.flatten()), fprime=grad)
theta_opt=res[0]
Instead of returning the value of theta that minimizes the function as theta_opt, it says:
/home/marco/anaconda3/lib/python3.6/site-packages ipykernel_launcher.py:8: RuntimeWarning: divide by zero encountered in log
I have no clue where this divide by zero occurs, given that there is literally no division in the whole code, except for the division by len(y), which is 5000, and the division in the sigmoid function (1/(1+exp(-z)), which can never be 0!
Any suggestions?
I have a problem with a GARCH model in python. My code looks as follow
import sys
import numpy as np
import pandas as pd
from arch import arch_model
sys.setrecursionlimit(1800)
spotmarket = pd.read_excel("./data/external/Spotmarket.xlsx", index=True)
l = spotmarket['Price'].pct_change().dropna()
returns = 100 * l
returns.plot()
plt.show()
model=arch_model(returns, vol='Garch', p=1, o=0, q=1, dist='Normal')
results=model.fit()
print(results.summary())
The first part of the code works well. I have end of the day prices in a separate excel table and want to model them with a GARCH model. The problem is, that I get the error message The optimizer returned code 9. The message is:
Iteration limit exceeded
See scipy.optimize.fmin_slsqp for code meaning.
Has someone an idea, how I can handle the problem with the iteration limit? Thank you!
Reading the source code (here), you can pass additional parameters to the fit method. Internally, scipy.optimize.minimize (doc) is called and the parameters of interest to you are probably max_iter and ftol.
Try manually changing the default values (max_iter=100 and ftol= 1e-06) to new ones that might lead to convergence. Example:
results=model.fit(options={'max_iter': 200})
I'm getting ValueError: Linkage 'Z' uses the same cluster more than once. when trying to get flat clusters in Python with scipy.cluster.hierarchy.fcluster. This error happens only sometimes, usually only with really big matrices ie 10000x10000.
import scipy.cluster.hierarchy as sch
Z = sch.linkage(d, method="ward")
# some computation here, returning n (usually between 5-30)
clusters = sch.fcluster(Z, t=n, criterion='maxclust')
Why does it happen? How can I avoid it? Unfortunately I couldn't find any useful info by googling...
EDIT Error occurs also when trying to get dendrogram.
No such error appear if method='average' is used.
It seems using fastcluster instead of scipy.cluster.hierarchy solves the problem. In addition, fastcluster implementation is slightly faster than scipy.
For more details have a look at the paper.
import fastcluster
Z = fastcluster.linkage(d, method="ward")
# some computation here, returning n (usually between 5-30)
clusters = fastcluster.fcluster(Z, t=n, criterion='maxclust')
I'm fairly new to programming, but this problem happens in python and in excel as well.
I'm using the following formulas for the RC transfer function
s/(s+1) for High Pass
1/(s+1) for Low Pass
with s = jwRC
below is the code I used in python
from pylab import *
from numpy import *
from cmath import *
"""
Generating a transfer function for RC filters.
Importing modules for complex math and plotting.
"""
f = arange(1, 5000, 1)
w = 2.0j*pi*f
R=100
C=1E-5
hp_tf = (w*R*C)/(w*R*C+1) # High Pass Transfer function
lp_tf = 1/(w*R*C+1) # Low Pass Transfer function
plot(f, hp_tf) # plot high pass transfer function
plot(f, lp_tf, '-r') # plot low pass transfer function
xscale('log')
I can't post images yet so I can't show the plot. But the issue here is the cutoff frequency is different for each one. They should cross at y=0.707, but they actually cross at about 0.5.
I figure my formula or method is wrong somewhere, but I can't find the mistake can anyone help me out?
Also, on a related note, I tried to convert to dB scale and I get the following error:
TypeError: only length-1 arrays can be converted to Python scalars
I'm using the following
debl=20*log(hp_tf)
This is a classical example why you should avoid pylab and more generally imports of the form
from module import *
unless you know exactly what it does, since it hopelessly clutters the name space.
Using,
import matplotlib.pyplot as plt
import numpy as np
and then calling np.log and plt.plot etc. will solve your problem.
Furether explanations
What's happening here is that,
from pylab import *
defines a log function from numpy that operate on arrays (the one you want).
However, the later import,
from cmath import *
overwrites it with a version that only accepts scalars, hence your error.