Applying SVD throws a Memory Error instantaneously? - python

I am trying to apply SVD on my matrix (3241 x 12596) that was obtained after some text processing (with the ultimate goal of performing Latent Semantic Analysis) and I am unable to understand why this is happening as my 64-bit machine has 16GB RAM. The moment svd(self.A) is called, it throws an error. The precise error is given below:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 81, in svd
overwrite_a = overwrite_a)
MemoryError
So I tried using
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
and this time, it throws the following error:
Traceback (most recent call last):
File ".\SVD.py", line 985, in <module>
_svd.calc()
File ".\SVD.py", line 534, in calc
self.U, self.S, self.Vt = svd(self.A, full_matrices= False)
File "C:\Python26\lib\site-packages\scipy\linalg\decomp_svd.py", line 71, in svd
return numpy.linalg.svd(a, full_matrices=0, compute_uv=compute_uv)
File "C:\Python26\lib\site-packages\numpy\linalg\linalg.py", line 1317, in svd
work = zeros((lwork,), t)
MemoryError
Is this supposed to be such a large matrix that Numpy cannot handle and is there something that I can do at this stage without changing the methodology itself?

Yes, the full_matrices parameter to scipy.linalg.svd is important: your input is highly rank-deficient (rank max 3,241), so you don't want to allocate the entire 12,596 x 12,596 matrix for V!
More importantly, matrices coming from text processing are likely very sparse. The scipy.linalg.svd is dense and doesn't offer truncated SVD, which results in a) tragic performance and b) lots of wasted memory.
Have a look at the sparseSVD package from PyPI, which works over sparse input and you can ask for top K factors only. Or try scipy.sparse.linalg.svd, though that's not as efficient and only available in newer versions of scipy.
Or, to avoid the gritty details completely, use a package that does efficient LSA for you transparently, such as gensim.

Apparently, as it turns out, thanks to #Ferdinand Beyer, I did not notice that I was using a 32-bit version of Python on my 64-bit machine.
Using a 64-bit version of Python and reinstalling all the libraries solved the problem.

Related

Unable to load a pretrained model

After training my model for almost 2 days 3 files were generated:
best_model.ckpt.data-00000-of-00001
best_model.ckpt.index
best_model.ckpt.meta
where best_model is my model name.
When I try to import my model using the following command
with tf.Session() as sess:
saver = tf.train.import_meta_graph('best_model.ckpt.meta')
saver.restore(sess, "best_model.ckpt")
I get the following error
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/shreyash/.local/lib/python2.7/site-
packages/tensorflow/python/training/saver.py", line 1577, in
import_meta_graph
**kwargs)
File "/home/shreyash/.local/lib/python2.7/site-
packages/tensorflow/python/framework/meta_graph.py", line 498, in import_scoped_meta_graph
producer_op_list=producer_op_list)
File "/home/shreyash/.local/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 259, in import_graph_def
raise ValueError('No op named %s in defined operations.' % node.op)
ValueError: No op named attn_add_fun_f32f32f32 in defined operations.
How to fix this?
I have referred this post: TensorFlow, why there are 3 files after saving the model?
Tensorflow version 1.0.0 installed using pip
Linux version 16.04
python 2.7
The importer can't find a very specific function in your graph, namely attn_add_fun_f32f32f32, which is likely to be one of attention functions.
Probably you've stepped into this issue. However, they say it's bundled in tensorflow 1.0. Double check that installed tensorflow version contains attention_decoder_fn.py (or, if you are using another library, check that it's there).
If it's there, here are your options:
Rename this operation, if possible. You might want to read this discussion for workarounds.
Duplicate your graph definition, so that you won't have to call import_meta_graph, but restore the model into the current graph.

"index N is out of bounds for axis 0 with size N" when running Parallel KMeans whereas sequential KMeans works fine

I'm trying to run KMeans using scikit-learn implementation in parallel, but I keep getting the following error message:
Traceback (most recent call last):
File "run_kmeans.py", line 114, in <module>
kmeans = KMeans(n_clusters=2048, n_jobs=-1).fit(descriptors)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 889, in fit
return_n_iter=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 362, in k_means
for seed in seeds)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 768, in __call__
self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibIndexError: JoblibIndexError
_________________________________________________________________________
Multiprocessing exception:
..........................................................................
IndexError: index 11683 is out of bounds for axis 0 with size 11683
When I run KMeans with n_jobs=1, i.e. in as sequential manner, I get no errors and everything works just fine. But with n_jobs=-1 I keep getting the error.
Here's the code I use:
kmeans = KMeans(n_clusters=2048, n_jobs=-1).fit(descriptors)
descriptors is a numpy array with shape (11683, 128).
Am I doing something wrong or is it a bug in KMeans implementation?
What should I do about it (e.g. use BiniBatchKMeans etc)?
PS: I'm running it on the Ubuntu 16.04 64-bit machine with 4 Gb of RAM and Intel Core i7-4700HQ 2.40GHz
This problem can be fixed by converting the input data to float64, as descriptors.astype(np.float64).
https://github.com/scikit-learn/scikit-learn/issues/8583

Flushing memmap completely to disk [duplicate]

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA.
Therefore, I shifted to using Iterative PCA.
Dataset Size-(140000,3504)
The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.
This is really good, but unsure on how take advantage of this.
I tried load one memmap hoping it would access it in chunks but my RAM blew.
My code below ends up using a lot of RAM:
ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)
When I say "my RAM blew", the Traceback I see is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
return self.fit(X, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
line 171, in fit
X = check_array(X, dtype=np.float)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError
How can I improve this without comprising on accuracy by reducing the batch-size?
My ideas to diagnose:
I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.
for batch in gen_batches(n_samples, self.batch_size_):
self.partial_fit(X[batch])
return self
Edit:
Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.
Edit2:
If somehow I could delete a batch of processed memmap file. It would make much sense.
Edit3:
Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.
temp_train_data=X_train[1000:]
temp_labels=y[1000:]
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
actual_index=index+1000
data=X_train[actual_index-1000:actual_index+1].ravel()
__,cd_i=pywt.dwt(data,'haar')
out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)
Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().
You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.
In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.
Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.
So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...
N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.
Detailed diagnosis:
The actual error generated by OP code:
import numpy as np
from sklearn.decomposition import IncrementalPCA
ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)
is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
return self.fit(X, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
line 171, in fit
X = check_array(X, dtype=np.float)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError
The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.
This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.
The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).
Does the following alone trigger the crash?
X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)
If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:
X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
X_batch_projected = clf.transform(X_train_mmap[batch])
X_projected_mmap[batch] = X_batch_projected
I have not tested that code but I hope that you get the idea.

Trying to parallelize parameter search in scikit-learn leads to "SystemError: NULL result without error in PyObject_Call"

I'm using the sklearn.grid_search.RandomizedSearchCV class from scikit-learn 14.1, and I get an error when running the following code:
X, y = load_svmlight_file(inputfile)
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X.toarray())
parameters = {'kernel':'rbf', 'C':scipy.stats.expon(scale=100), 'gamma':scipy.stats.expon(scale=.1)}
svr = svm.SVC()
classifier = grid_search.RandomizedSearchCV(svr, parameters, n_jobs=8)
classifier.fit(X_scaled, y)
When I set the n_jobs parameter to more than 1, I get the following error output:
Traceback (most recent call last):
File "./svm_training.py", line 185, in <module>
main(sys.argv[1:])
File "./svm_training.py", line 63, in main
gridsearch(inputfile, kerneltype, parameterfile)
File "./svm_training.py", line 85, in gridsearch
classifier.fit(X_scaled, y)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux- x86_64.egg/sklearn/grid_search.py", line 860, in fit
return self._fit(X, y, sampled_params)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/grid_search.py", line 493, in _fit
for parameters in parameter_iterable
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 419, in retrieve
self._output.append(job.get())
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call
It seems to have something to do with the python multiprocessing functionality, but I'm not sure how to work around it other than just implement the parallelization for the parameter search by hand. Has anyone had a similar issue with trying to parallelize the randomized parameter search in that they were able to solve?
It turns out the problem was with the use of MinMaxScaler. Since MinMaxScaler only accepts dense arrays, I was translating the sparse representation of the feature vector to a dense array before scaling. Since the feature vector has thousands of elements, my assumption is that the dense arrays caused a memory error when trying to parallelize the parameter search. Instead, I switched to StandardScaler, which accepts sparse arrays as input, and should be better for use with my problem space anyway.

SymPy cannot solve an equation that Matlab can

I have an equation which is related to the sun-synchronous resonance condition in orbital mechanics. I'm learning Python at the moment, so I attempted to solve it in SymPy using the following code:
from sympy import symbols,solve
[n_,Re_,p_,i_,J2_,Pe_] = symbols(['n_','Re_','p_','i_','J2_','Pe_'])
del_ss = -((3*n_*(Re_**2)*J2_/(4*(p_**2)))*(4-5*(sin(i_)**2)))-((3*n_*(Re_**2)*J2_/(2*(p_**2)))*cos(i_))-((2*pi)/Pe_)
pprint(solve(del_ss,i_))
The expression can be successfully rearranged for five of the variables, but when the variable i_ is used in the solve command (as above), an error is produced:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 479, in runfile
execfile(filename, namespace)
File "C:\Users\Nathan\Python\sympy_test_1.py", line 22, in <module>
pprint(solve(del_ss,i_))
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 484, in solve
solution = _solve(f, *symbols, **flags)
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 700, in _solve
soln = tsolve(f_num, symbol)
File "C:\Python27\lib\site-packages\sympy\solvers\solvers.py", line 1143, in tsolve
"(tsolve: at least one Function expected at this point")
NotImplementedError: Unable to solve the equation(tsolve: at least one Function expected at this point
However, when the same expression is entered into Matlab and the solve command is called, it is rearranged correctly. I realise that the error mentions a non-implemented feature and that the two functions will no doubt differ, but it would still be nice to know if there's a more appropriate SymPy function that I can use. Any help would be greatly appreciated.
Use the sympy version of Pi.
Substitute cos(i_) by a new variable ci_, replace sin(i_)**2 by 1-ci_**2, and solve for ci_.
This should do it:
from sympy import symbols,solve,sin,cos,pi
[n_,Re_,p_,ci_,J2_,Pe_] = symbols(['n_','Re_','p_','ci_','J2_','Pe_'])
del_ss = -((3*n_*(Re_**2)*J2_/(4*(p_**2)))*(4-5*(1-ci_**2)))-((3*n_*(Re_**2)*J2_/(2*(p_**2)))*ci_)-((2*pi)/Pe_)
pprint(solve(del_ss,ci_))
(Edited because I only wrote half of the solution in the first attempt...)

Categories

Resources