First of all, I tried to perform dimensionality reduction on my n_samples x 53 data using scikit-learn's Kernel PCA with precomputed kernel. The code worked without any issues when I tried using 50 samples at first. However, when I increased the number of samples into 100, suddenly I got the following message.
Process finished with exit code -1073740940 (0xC0000374)
Here's the detail of what I want to do:
I want to obtain the optimum value of kernel function hyperparameter in my Kernel PCA function, defined as the following.
from sklearn.decomposition.kernel_pca import KernelPCA as drm
from somewhere import costfunction
from somewhere_else import customkernel
def kpcafun(w,X):
# X is sample
# w is hyperparam
n_princomp = 2
drmodel = drm(n_princomp,kernel='precomputed')
k_matrix = customkernel (X,X,w)
transformed_x = drmodel.fit_transform(k_matrix)
cost = costfunction(transformed_x)
return cost
Therefore, to optimize the hyperparams I used the following code.
from scipy.optimize import minimize
# assume that wstart and optimbound are already defined
res = minimize(kpcafun, wstart, method='L-BFGS-B', bounds=optimbound, args=(X))
The strange thing is when I tried to debug the first 10 iterations of the optimization process, nothing strange has happened all values of the variables seemed normal. But, when I turned off the breakpoints and let the program continue the message appeared without any error notification.
Does anyone know what might be wrong with my code? Or anyone has some tips to resolve a problem like this?
Thanks
Related
I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922
I have done a linear regression problem with boston dataset and I have obtained the next results:
Loss value does not change with increasing number of value. What is the reason of this mistake? Please, help me
import pandas as pd
import torch
import numpy as np
import torch.nn as nn
from sklearn import preprocessing
training_set=pd.read_csv('boston_data.csv')
training_set=training_set.to_numpy()
test_set=test_set.to_numpy()
inputs=training_set[:,0:13]
inputs=preprocessing.normalize(inputs)
target=training_set[:,13:14]
target=preprocessing.normalize(target)
inputs=torch.from_numpy(inputs)
target=torch.from_numpy(target)
test_set=torch.from_numpy(test_set)
w=torch.randn(13,1,requires_grad=True)
b=torch.randn(404,1,requires_grad=True)
def model(x):
return x#w+b
pred=model(inputs.float())
def loss_MSE(x,y):
ras=x-y
return torch.sum(ras * ras) / ras.numel()
for i in range(100):
pred=model(inputs.float())
loss=loss_MSE(target,pred)
loss.backward()
with torch.no_grad():
w -= w.grad * 1e-5
b -= b.grad * 1e-5
w.grad.zero_()
b.grad.zero_()
print(loss)
welcome to Stackoverflow
Your main loop is fine (you could have made you life much easier however, you should probably read this), but your learning rate (1e-5) is most likely way too low.
I tried with a small dummy problem, it was solved very quickly with a learning rate ~1e-2, and would take tremendously longer with 1e-5. It does converge anyway though, but after much more than 100 epochs. You mentionned you tried increasing the number of epoch, did not write how with many you actually ran the experiment.
Please try increasing this parameter (learning rate) to see whether it solves your issue. You can also try removing the division by numel(), which will have the same effect (the division is also applied to the gradients).
Next time, please provide a small minimal example than can be run and help reproduce your error. Here most of your code is data loading which can be replaced with 2 lines of dummy data generation.
I am new to Python, so I am not sure if this problem is due to my inexperience or whether this is a glitch.
I am running this code multiple times on the same data (no random number generation) and getting different results. This has occurred with more than one variable so far, and obviously I cannot proceed with the analysis until I figure out which results are trustworthy. Here is a short sample of the results I have obtained after running the code four times. Why is there such a discrepancy between these outputs? I am puzzled and greatly appreciate your advice.
Linear Regression
from scipy.stats import linregress
import scipy.stats
from scipy.signal import welch
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
part_022_o = pd.read_excel(r'C:\Users\Me\Desktop\Behavioral Data Processed\part_022_combined_other.xlsx')
distance_o = part_022_o["distance"]
fs = 200
f, Pwelch_spec = signal.welch(distance_o, fs=fs, window='hanning',nperseg=400, noverlap=200, scaling='density', average='mean')
log_f = np.log(f, where=f>0)
log_pwelch = np.log(Pwelch_spec, where=Pwelch_spec>0)
idx = np.isfinite(log_f) & np.isfinite(log_pwelch)
polynomial_coefficients = np.polyfit(log_f[idx],log_pwelch[idx],1)
print(polynomial_coefficients)
scipy.stats.linregress(log_f[idx], log_pwelch[idx])
Results First Attempt
[ 0.00324568 -2.82962602]
Results Second Attempt
[-2.70137164 6.97117509]
Results Third Attempt
[-2.70137164 6.97117509]
Results Fourth Attempt
[-2.28028005 5.53839502]
The same thing happens when I use scipy.stats.linregress().
Thank you,
Confused
Edit: full code added.
Also, the issue appears to be related to np.log(), since only the values of "log_f" array seem to be changing with the different outputs. It is hard to be certain that nothing else is changing (e.g. log_pwelch), but differences in output clearly correspond to differences in the first value of the "log_f" array.
Edit: I have narrowed the issue down to np.log(f, where=f>0). The first value in the f array is zero. According to the documentation of numpy log, "...Note that if an uninitialized out array is created via the default out=None, locations within it where the condition is False will remain uninitialized." Apparently this means that the value or variable is unpredictable and can vary from trial to trial, which is exactly what I am observing. Given my inexperience with Python, I am not sure what the best solution is (e.g. specifying the out-array in the log function, use a random seed, just note the regression coefficients whenever the value of zero is unchanged after log, etc.)
Try to use a random seed to reproduce results. Do this with the following code at the top of your program:
import numpy as np
np.random.seed(123) or any number you want
see here for more info: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html
A random seed ensures you get repeatable results when some part of your program is generating numbers at random.
Try finding out what the functions (np.polyfit(), np.log()) are actually doing using documentation.
This is standard practice for scikit-learn and ML to use a seed value.
I am trying to run a tsne analysis on a square distance matrix. These are the commands I am using.
model = TSNE(n_components = 2,perplexity = 32, verbose = 10,n_iter = 1000, metric = "precomputed")
embeddings = model.fit_transform(D)
This is the output I receive: output from tsne function
It looks like the program is running through 75 iterations and then calling it good and quitting. When I plot the data from the tsne it's basically just a single dense blob. Why is the program quitting early and how can I make it run longer?
It's quitting because the exit-condition is reached.
Interpreting the log, the exit-condition is probably a metric on the gradient, called gradient-norm here. If needed, checkout the basics of gradient-descent to understand the intuition. As every iteration is making a step towards the negative of the gradient, tiny gradients will not do much to the objective (and will be interpreted as: we found a local/global minimum).
It looks like (still interpreting your log only):
if np.linalg.norm(gradient) < 1e-4:
return solution
There is no merit to further do more iterations for this parameterization of the optimization-problem. The solution won't get better (in terms of minimization).
You can only try other parameters (resulting in other optimization-problems).
So this sounds a bit convoluted, but I've had a problem with joblib lately where it will create a bunch of processes and then just hang there (aka, each process takes up memory, but uses no CPU time).
Here is the simplest code I've got that will reproduce the problem:
from sklearn import linear_model
import numpy as np
from sklearn import cross_validation as cval
from joblib import Parallel, delayed
def fit_hanging_model(n=10000, nx=10, ny=32, ndelay=10,
n_cvs=5, n_jobs=None):
# Create data
X = np.random.randn(n, ny*ndelay)
y = np.random.randn(n, nx)
# Create model + CV
model = linear_model.Ridge(alpha=1000.)
cvest = cval.KFold(n, n_folds=n_cvs, shuffle=True)
# Fit model
par = Parallel(n_jobs=n_jobs, verbose=10)
parfunc = delayed(_fit_model_cvs)
par(parfunc(X, y, train, test, model)
for i, (train, test) in enumerate(cvest))
def _fit_model_cvs(X, Y, train, test, model):
model.fit(X, Y)
n = 10
a = np.random.randn(n, 32)
b = np.random.randn(32, 10)
##########
c = np.dot(a, b)
##########
fit_hanging_model(n_jobs=3)
Here is what happens:
If I run all of the code above, then it spawns off three processes and hangs
If I run all of the code above, but use n_jobs=1, then it works fine
If I run all of the code above a second time, after running it once with n_jobs=1, then it works fine no matter how many jobs I use.
If I run all of the code above EXCEPT for the code between the ######, then it runs fine.
However, if I then run the code between the ######, and try to run fit_hanging_model with n_jobs > 1, then it hangs
This is with joblib = 0.8.0, and sklearn 0.15-git.
Note, this bug is on CentOS on linux. I have not been able to reproduce this bug on another machine, so it may be hard to reproduce.
Does anyone have any idea why this might be going on? It seems like that dot product is doing something strange, but I have no idea what it could be...I'm at the end of my rope...
Figured it out. Apparently this was an issue with Joblib creating multiple python processes, while MKL was simultaneously trying to do threading. See the issue and the solution (which involves setting environment variables) here:
https://github.com/joblib/joblib/issues/138
According to my experience, joblib hangs when the function it calls makes another level of parallelization.
Using the solution in the answer by #choldgraf, you need to disable the inner level of parallelization by MKL:
import os
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_DYNAMIC'] = 'FALSE'
This applies to other parallel computing libraries such as OpenMP.