I try to solve next problem
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
In fact, I get X1 from TfidfVectorizer but let go of the code for the sake of brevity
I want to apply sparse.hstack to use both variables in a regression.
I convert pandas to numpy.ndarray as below
X2 = df['a'].as_matrix()
type(X2)
numpy.ndarray
X = sparse.hstack((X1,X2))
ValueError Traceback (most recent call last)
<ipython-input-38-9493e3833c5d> in <module>()
----> 1 X = sparse.hstack((X1,X2))
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
579 elif brow_lengths[i] != A.shape[0]:
580 raise ValueError('blocks[%d,:] has incompatible '
--> 581 'row dimensions' % i)
582
583 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions
What's wrong?
I've done as below. It works
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
X2 = df['a'].reset_index()
X2 = X2.iloc[:,[1]].values
X = sparse.hstack((X1,X2))
your arrays must have the same first dimension size and must contain at least 1 row each.
you can check that by X1.shape() and X2.shape()
Related
I am trying to run PCA (Principal component analysis) on GPU. I am using skcuda.linalg.PCA for that purpose, but it's not working. From their tutorial (https://scikit-cuda.readthedocs.io/en/latest/generated/skcuda.linalg.PCA.html):
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linalg as linalg
from skcuda.linalg import PCA as cuPCA
pca = cuPCA(n_components=4) # map the data to 4 dimensions
X = np.random.rand(1000,100) # 1000 samples of 100-dimensional data vectors
X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
X_gpu.set(X) # copy data to gpu
T_gpu = pca.fit_transform(X_gpu) # calculate the principal components
When I run it I get the following error:
cublasInternalError Traceback (most recent call last)
<ipython-input-31-02aaf0fa19e4> in <module>
8 X_gpu = gpuarray.GPUArray((1000,100), np.float64, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
9 X_gpu.set(X) # copy data to gpu
---> 10 T_gpu = pca.fit_transform(X_gpu)
/opt/conda/lib/python3.7/site-packages/skcuda/linalg.py in fit_transform(self, X_gpu)
204 cuGemv (self.h, 'n', p, k, -1.0, P_gpu.gpudata, p, U_gpu.gpudata, 1, 1.0, P_gpu[:,k].gpudata, 1)
205
--> 206 l2 = cuNrm2(self.h, p, P_gpu[:,k].gpudata, 1)
207 cuScal(self.h, p, 1.0/l2, P_gpu[:,k].gpudata, 1)
208 cuGemv(self.h, 'n', n, p, 1.0, R_gpu.gpudata, n, P_gpu[:,k].gpudata, 1, 0.0, T_gpu[:,k].gpudata, 1)
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasDnrm2(handle, n, x, incx)
1295 n, int(x), incx,
1296 ctypes.byref(result))
-> 1297 cublasCheckStatus(status)
1298 return np.float64(result.value)
1299
/opt/conda/lib/python3.7/site-packages/skcuda/cublas.py in cublasCheckStatus(status)
177 raise cublasError
178 else:
--> 179 raise e
180
181 # Helper functions:
cublasInternalError
Initially, I was running on my own data and I got this error. Then I decided to run the example and I got exactly the same error. Does any1 know what's the problem here? I am using Kaggle notebook with Tesla T4 GPU. Thanks.
here is the code:
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
from os import listdir
from pathlib import Path
jpeg_images = list(Path(r'D:/ncfm/train').glob('**/*.jpg'))
np.array([np.array(cv2.imread(str(file))).flatten() for file in
jpeg_images])
folder = ['ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT','test
images']
Path = r'D:\ncfm\train'
for i in range(9):
listing = os.listdir(Path+'/'+folder[i])
folder[i] = np.array([np.array(cv2.imread(Path+'/'+folder[i]+'/'+file)).flatten()for file in listing])
L.append(len(listing))
next I have tried to concatenate this.
M = np.concatenate((folders[1], folders[2], folders[3], folders[4],
folders[5], folders[6], folders[7], folders[8]))
next i have done the labelling
label = np.ones((3777), dtype=int)
label[0:1720]=1
label[1720:1920]=2
label[1920:2038]=3
label[2038:2104]=4
label[2104:2568]=5
label[2568:2868]=6
label[2868:3044]=7
label[3044:3777]=8
from sklearn.utils import shuffle
data,Label = shuffle(M, label, random_state = 2)
here comes the error;
ValueError Traceback (most recent call last)
<ipython-input-148-f7cec68b48c6> in <module>
1 from sklearn.utils import shuffle
2
----> 3 data,Label = shuffle(M, label, random_state = 2)
~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in
shuffle(*arrays, **options)
447 """
448 options['replace'] = False
--> 449 return resample(*arrays, **options)
450
451
~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in
resample(*arrays, **options)
330 n_samples))
331
--> 332 check_consistent_length(*arrays)
333
334 if stratify is None:
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in
check_consistent_length(*arrays)
203 if len(uniques) > 1:
204 raise ValueError("Found input variables with inconsistent
numbers of"
--> 205 " samples: %r" % [int(l) for l in
lengths])
206
207
ValueError: Found input variables with inconsistent numbers of samples:
[3058, 3777]
At first i got the length as [8, 3777]. After converting RGB to gray scale and resizing it i got the length as [3058,3777]. i want to shuffle the rows in the matrix M and rows of the label simultaneously.
Can someone help me to figure out why i'm having this error code : ValueError: n_components must be < n_features; got 10 >= 0
import pandas as pd
from scipy.sparse import csr_matrix
users = pd.read_table(open('ml-1m/users.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'])
ratings = pd.read_table(open('ml-1m/ratings.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_table(open('ml-1m/movies.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['movie_id', 'title', 'genres'])
MovieLens = pd.merge(pd.merge(ratings, users), movies)
ratings_mtx_df = MovieLens.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)
movie_index = ratings_mtx_df.columns
from sklearn.decomposition import TruncatedSVD
recom = TruncatedSVD(n_components=10, random_state=101)
R = recom.fit_transform(ratings_mtx_df.values.T)
ValueError Traceback (most recent call last)
<ipython-input-8-0bd6c9bda95a> in <module>()
1 from sklearn.decomposition import TruncatedSVD
2 recom = TruncatedSVD(n_components=10, random_state=101)
----> 3 R = recom.fit_transform(ratings_mtx_df.values.T)
C:\Users\renau\Anaconda3\lib\site-packages\sklearn\decomposition\truncated_svd.py in fit_transform(self, X, y)
168 if k >= n_features:
169 raise ValueError("n_components must be < n_features;"
--> 170 " got %d >= %d" % (k, n_features))
171 U, Sigma, VT = randomized_svd(X, self.n_components,
172 n_iter=self.n_iter,
ValueError: n_components must be < n_features; got 10 >= 0
You're trying to split your data into 10 dimensions, but as per the documentation for TruncatedSVD, the number of features (columns) in your ratings_mtx_df data needs to be greater than the number of dimensions/components you're looking to extract. Try n_components=3 (assuming you've got at least 3 features in your data) and see if that's any better.
Also, you're turning your input data sideways, with the .T argument in:
R = recom.fit_transform(ratings_mtx_df.values.T)
That may result in switching features (columns) for observations(rows) which might explain why the fit_transform method isn't working.
The lines in question are:
# Make efficient matrix that can be built
K = sparse.lil_matrix((N, N))
# Calculate K matrix (<i|pHp|j> in the LGL-nodes basis)
for i in range(Ne):
idx_s, idx_e = i*(Np-1), i*(Np-1)+Np
print(shape(K[idx_s:idx_e, idx_s:idx_e]))
print(shape(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)))
K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
But, currently, Numpy is yielding the error
(8, 8)
(8, 8)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-62-cc7cc21f07e5> in <module>()
22
23 for _ in range(N):
---> 24 ll, q = getLL(Ne, Np, x_d, w_d, dmat_d, x, w, dL, peq*peq, data)
25 peq = (peq*q)
26
<ipython-input-61-a52c13d48b87> in getLL(Ne, Np, x_d, w_d, dmat_d, x, w, dmat, peq, data)
15 print(shape(K[idx_s:idx_e, idx_s:idx_e]))
16 print(shape(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)))
---> 17 K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
18
19 # Re-make matrix for efficient vector products
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/scipy/sparse/lil.py in __iadd__(self, other)
157
158 def __iadd__(self,other):
--> 159 self[:,:] = self + other
160 return self
161
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/scipy/sparse/lil.py in __setitem__(self, index, x)
307
308 # Make x and i into the same shape
--> 309 x = np.asarray(x, dtype=self.dtype)
310 x, _ = np.broadcast_arrays(x, i)
311
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
This is a little cryptic as it seems that the error is happening somewhere inside of the Numpy library---not in my code. But I'm not terribly familiar with numpy, per se, so perhaps I'm indirectly causing the error.
Both slices are of the same shape, so that doesn't seem to be the actual error.
The problem is that
(dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np)).dot(dmat)
is not a simple array. It has the right shape, but the elements are sparse matrices (the 'sequence' in the error message).
Turning the inner sparse matrix into a dense array should solve the problem:
dmat.T.dot(sparse.spdiags(w*peq[idx_s:idx_e], 0, Np, Np).A).dot(dmat)
The np.dot method is not aware of sparse matrices, at least not in your version of numpy (1.8?), so it treats it as sequence. Newer versions are 'sparse' aware.
Another solution is to use the sparse matrix product (dot or *).
sparse.spdiags(...).dot(dmat etc)
I had to play around to get reasonable values for N,Np,Ns, dmat,peq. You really should have given us small samples. It makes testing ideas much easier.
I have numeric data stored in two DataFrames x and y. The inner product from numpy works but the dot product from pandas does not.
In [63]: x.shape
Out[63]: (1062, 36)
In [64]: y.shape
Out[64]: (36, 36)
In [65]: np.inner(x, y).shape
Out[65]: (1062L, 36L)
In [66]: x.dot(y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-66-76c015be254b> in <module>()
----> 1 x.dot(y)
C:\Programs\WinPython-64bit-2.7.3.3\python-2.7.3.amd64\lib\site-packages\pandas\core\frame.pyc in dot(self, other)
888 if (len(common) > len(self.columns) or
889 len(common) > len(other.index)):
--> 890 raise ValueError('matrices are not aligned')
891
892 left = self.reindex(columns=common, copy=False)
ValueError: matrices are not aligned
Is this a bug or am I using pandas wrong?
Not only must the shapes of x and y be correct, but also
the column names of x must match the index names of y. Otherwise
this code in pandas/core/frame.py will raise a ValueError:
if isinstance(other, (Series, DataFrame)):
common = self.columns.union(other.index)
if (len(common) > len(self.columns) or
len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
If you just want to compute the matrix product without making the column names of x match the index names of y, then use the NumPy dot function:
np.dot(x, y)
The reason why the column names of x must match the index names of y is because the pandas dot method will reindex x and y so that if the column order of x and the index order of y do not naturally match, they will be made to match before the matrix product is performed:
left = self.reindex(columns=common, copy=False)
right = other.reindex(index=common, copy=False)
The NumPy dot function does no such thing. It will just compute the matrix product based on the values in the underlying arrays.
Here is an example which reproduces the error:
import pandas as pd
import numpy as np
columns = ['col{}'.format(i) for i in range(36)]
x = pd.DataFrame(np.random.random((1062, 36)), columns=columns)
y = pd.DataFrame(np.random.random((36, 36)))
print(np.dot(x, y).shape)
# (1062, 36)
print(x.dot(y).shape)
# ValueError: matrices are not aligned