Im working with python, sklearn and numpy and I am creating the following sparse matrix:
feats = tfidf_vect.fit_transform(np.asarray(tweets))
print(feats)
feats=np.log(np.asarray(feats))
but I am getting the following error when I apply the log:
Traceback (most recent call last):
File "src/ef_tfidf.py", line 100, in <module>
feats=np.log(np.asarray(feats))
AttributeError: log
the error is related with the fact that feats it's a sparse matrix I would appreciate any help with this, I mean a way to apply the log to a sparse matrix.
The correct way to convert a sparse matrix to an ndarray is with the toarray method:
feats = np.log(feats.toarray())
np.array doesn't understand sparse matrix inputs.
If you want to only take the log of non-zero entries and return a sparse matrix of results, the best way would probably be to take the logarithm of the matrix's data and build a new sparse matrix with that data.
How that works through the public interface is different for different sparse matrix types; you'd want to look up the constructor for whatever type you have. Alternatively, there's the private _with_data method:
feats = feats._with_data(np.log(feats.data), copy=True)
So I actually needed to take something like log(p+1) for some sparse matrix p and I found this scipy method log1p which returns exactly that on a sparse matrix. I don't have enough reputation to comment so I'm just putting this here in case it helps anyone.
You could apply this to the original question with
feats = (feats-1).log1p()
This has the advantage of keeping feats sparse.
fit_transform()returns scipy.sparse.coo_matrix object, which has data attribute linked to data array of the sparse matrix
You can use the data attribute to manipulate non-zero data of coo sparse matrix directly, as following:
feats.data = np.log(feats.data)
Related
This question already has an answer here:
How to make TF-IDF matrix dense?
(1 answer)
Closed 2 years ago.
I am looking at this example
https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/
exactly at the line where using TF-IDF
# create TF-IDF features
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
When i try to view the results of xtrain_tfidf I get this message
xtrain_tfidf
Out[69]:
<33434x10000 sparse matrix of type '<class 'numpy.float64'>'
with 3494870 stored elements in Compressed Sparse Row format>
I would like to see what does xtrain_tfidf have?
how can I view it?
Jupyter (or rather IPython (or rather the Python REPL)) implicitly calls xtrain_tfidf.__repr__() when you evaluate the name of the variable. Using print calls xtrain_tfidf.__str__(), which is what you're looking for when you want to see the nonzero values in a sparse matrix:
print(xtrain_tfidf)
If you want to print everything including zero-values, slowness and possible out-of-memory be darned, then try
import numpy as np
with np.printoptions(threshold=np.inf):
print(xtrain_tfidf.toarray())
xtrain,xtest,ytrain,ytest = train_test_split(df_train['clean_comments'],df_train['label'].values,test_size=0.3,shuffle = True)
vectorizer = TfidfVectorizer(strip_accents='unicode',analyzer='word',ngram_range=(1,3),norm='l2')
vectorizer.fit(xtrain)
x_train = vectorizer.transform(xtrain)
x_train = x_train.toarray()
I am trying to convert a sparse array to dense array using toarray() method but it shows memory error.
I've already tried todense() method but it didn't work too.
Sparse matrices are used to store only the values which are different than zeros in memory and are therefore very well adapted for bag of words matrices. If you try to convert a sparse matrix to a dense format, it consumes a lot more memory since it also stores the zeros. If you don't have enough memory it raises an out of memory error.
I have a matrix and I want to check if it is sparse or not.
Things I have tried:
isinstance method:
if isinstance(<matrix>, scipy.sparse.csc.csc_matrix):
This works fine if I know exactly which sparse class I want to check.
getformat method: But it assumes that my matrix is sparse and give format
But I want a way to know if matrix is sparse or not, and should work irrespective of which sparse class.
Kindly help me.
scipy.sparse.issparse(my_matrix)
You can do sparsity = 1.0 - count_nonzero(X) / X.size
This works for any matrices.
I have a huge sparse matrix. I would like to save the dense equivalent one into file system.
The problem is the memory limit on my machine.
My original idea is:
convert huge_sparse_matrix to ndarray by np.asarray(huge_sparse_matrix)
assign values
save it back to file system
However, at step 1, Python raises MemoryError.
One possible approach in my mind is:
create a chunk of the dense array
assign values from the corresponding sparse one
save the dense array chunk back to file system
repeat 1-3
But how to do that?
you can use the scipy.sparse function to read sparse matrix and then convert it to numpy , see documentation here scipy.sparse docs and examples
I think np.asarray() is not really the function you're looking for.
You might try the SciPy matrix format cco_matrix() (coordinate formatted matrix).
scipy.sparse.coo_matrix
this format allows to save huge sparse matrices in very little memory.
furthermore there are many mathematical scipy functions which also work with this matrix format.
The matrix representation in this format are basically three lists:
row: the index of the row
col: the index of the column
data: the value at this position
hope that helped, cheers
The common and most straightforward answer to memory problems is: Do not create objects, use an iterator or a generator.
If I understand correctly, you have a sparse matrix and you want to transform it into a list representation. Here's a sample code:
def iter_sparse_matrix ( m, d1, d2 ):
for i in xrange(d1):
for j in xrange(d2):
if m[i][j]:
yield ( i, j, m[i][j] )
dense_array = list(iter_sparse_matrix(m, d1, d2))
You might also want to look here:
http://cvxopt.org/userguide/matrices.html#sparse-matrices
If I'm not wrong the problem you have is that the dense of the sparse matrix does not fit in your memory, and thus, you are not able to save it.
What I would suggest you is to use HDF5. HDF5 handles big data in disk passing it to memory only when needed.
I something like this should work:
import h5py
data = # your sparse matrix
cx = data.tocoo() # coo sparse representation
This will create your data matrix (of zeros) in disk.
f = h5py.File('dset.h5','w')
dataset = f.create_dataset("data", data.shape)
Fill the matrix with the sparse data:
dataset[cx.row, cx.col] = cx.data
Add any modifications you want to dataset:
dataset[something, something] = something
And finally, save it:
file.close()
The way HDF5 works I think is perfect for your needs. The matrix is stored always in disk, so it doesn't require memory, however, you can operate with it as if it was a standard numpy matrix (indexing, slicing, np.(..) operations and so on) and the h5py driver will send the parts of the matrix that you need to memory (never the whole matrix unless you specifically require it with something like data[:, :]).
PS: I'm assuming your sparse matrix is one of the scipy's sparse matrix. If not replace cx.row, cx.col and cx.data from the ones provided by your matrix representation (should be something like it).
I'm a newbie to Python, and I'm trying to write the data in a matrix to a CSV file. The variable is defined as:
(Pdb) trainFeatures
<1562936x312116 sparse matrix of type '<type 'numpy.float64'>'
with 43753231 stored elements in Compressed Sparse Row format>
I have a line of code:
numpy.savetxt("feature_train.csv", trainFeatures, delimiter=',')
When I run that line, I get an error message:
ncol = X.shape[1]
IndexError: tuple index out of range
I'm sure the matrix is somehow not in the right format, but I don't know how to get it so. Can anyone point out what I need to do here?
Ok, to complete the process, the answer to the original question is to use the todense() method turn trainFeatures into a format that savetxt() recognizes. But to make a further comment on the lack of memory, the obvious solution would be to use the getrow() method and iterate through all rows and write each row to the file individually, rather than trying to do the whole matrix in one go.