This question already has an answer here:
How to make TF-IDF matrix dense?
(1 answer)
Closed 2 years ago.
I am looking at this example
https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/
exactly at the line where using TF-IDF
# create TF-IDF features
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
When i try to view the results of xtrain_tfidf I get this message
xtrain_tfidf
Out[69]:
<33434x10000 sparse matrix of type '<class 'numpy.float64'>'
with 3494870 stored elements in Compressed Sparse Row format>
I would like to see what does xtrain_tfidf have?
how can I view it?
Jupyter (or rather IPython (or rather the Python REPL)) implicitly calls xtrain_tfidf.__repr__() when you evaluate the name of the variable. Using print calls xtrain_tfidf.__str__(), which is what you're looking for when you want to see the nonzero values in a sparse matrix:
print(xtrain_tfidf)
If you want to print everything including zero-values, slowness and possible out-of-memory be darned, then try
import numpy as np
with np.printoptions(threshold=np.inf):
print(xtrain_tfidf.toarray())
I have built a python program processing the probability of various datasets. I input 'manually' various mean values and standard deviations, and that works, however I need to automate it so that I can upload all my data through a text or csv file. I've got so far but now have a nested for loop query I think with indices problems, but some background follows...
My code works for a small dataset where I can manually key in 6-8 parameters working but now I need to automate it and upload various inputs of unknown sizes by csv / text file. I am copying my existing code and amending it where appropriate but I have run into a problem.
I have a 2_D numpy-array where some probabilities have been reverse sorted. I have a second array which gives me the value of 68.3% of each row, and I want to trim the low value 31.7% data.
I need a solution which can handle an unspecified number of rows.
My pre-existing code worked for a single one-dimensional array was
prob_combine_sum= np.sum(prob_combine)
#Reverse sort the probabilities
prob_combine_sorted=sorted(prob_combine, reverse=True)
#Calculate 1 SD from peak Prob by multiplying Total Prob by 68.3%
sixty_eight_percent=prob_combine_sum*0.68269
#Loop over the sorted list and append the 1SD data into a list
#onesd_prob_combine
onesd_prob_combine=[]
for i in prob_combine_sorted:
onesd_prob_combine.append(i)
if sum(onesd_prob_combine) > sixty_eight_percent:
break
That worked. However, now I have a multi-dimensional array, and I want to take the 1 standard deviation data from that multi-dimensional array and stick it in another.
There's probably more than one way of doing this but I thought I would stick to the for loop, but now it's more complicated by the indices. I need to preserve the data structure, and I need to be able to handle unlimited numbers of rows in the future.
I simulated some data and if I can get this to work with this, I should be able to put it in my program.
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]])
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
#Task transfer data from sorted_probabilities to target array on
condition that value in each target row is less than the value in the
sd_test array.
#Ignore the problem that data transferred won't add up to 68.3%.
My real data-sample is very big. I just need a way of trimmining
and transferring.
for row in sorted_probabilities:
for element in row:
target_array[row].append[i]
if sum(target[row]) > sd_test[row]:
break
Error: IndexError: index 9 is out of bounds for axis 0 with size 4
I know it's not a very good attempt. My problem is that I need a solution which will work for any 2D array, not just one with 4 rows.
I'd be really grateful for any help.
Thank you
Edit:
Can someone help me out with this? I am struggling.
I think the reason my loop will not work is that the 'index' row I am using is not a number, but in this case a row. I will have a think about this. In meantime has anyone a solution?
Thanks
I tried the following code after reading the comments:
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter]=sorted_probabilities[counter][element]
if target_array[counter] > sd_test[counter]:
break
I get an error: IndexError: index 9 is out of bounds for axis 0 with size 9
I think it's because I am trying to add to numpy array of pre-determined dimensions? I am not sure. I am going to try another tack now as I can not do this with this approach. It's having to maintain the rows in the target array that is making it difficult. Each row relates to an object, and if I lose the structure it will be pointless.
I recommend you use pandas. You can read directly the csv in a dataframe and do multiple operations on columns and such, clean and neat.
You are mixing numpy arrays with python lists. Better use only one of these (numpy is preferred). Also try to debug your code, because it has either syntax and logical errors. You don't have variable i, though you're using it as an index; also you are using row as index while it is a numpy array, but not an integer.
I strongly recommend you to
0) debug your code (at least with prints)
1) use enumerate to create both of your for loops;
2) replace append with plain assigning, because you've already created an empty vector (target_array). Or initialize your target_array as empty list and append into it.
3) if you want to use your solution for any 2d array, wrap your code into a function
Try this:
sorted_probabilities=np.asarray([[9,8,7,6,5,4,3,2,1],
[87,67,54,43,32,22,16,14,2],
[100,99,78,65,45,43,39,22,3],
[67,64,49,45,42,40,28,23,17]]
)
sd_test=np.asarray([30.7215,230.0699,306.5323,256.0125])
target_array=np.zeros(4).reshape(4,1)
for counter, value in enumerate(sorted_probabilities):
for i, element in enumerate(value):
target_array[counter] = element # Here I removed the code that produced error
if target_array[counter] > sd_test[counter]:
break
I am using mlnn from skmultilearn.adapt library for one of my classification problems. The ouput which predict functions give me is sparse matrix of type int.
mlk=mlknn.MLkNN(k=10)
mlk.fit(training_M,Y_train)
output=mlk.predict(testing_M)
when i try to print the output like
print(output)
it shows me only 1 output i.e.
(0, 1120) 1
But I need to read the full matrix and find the non zero values.
if I do
output[2][4]
it shows me Row Index out of bound erro
How can i avoid this error and get the row and column index of all the non zero values?
This print is a condensed form and means that there is only one non-zero value in that matrix, otherwise there would be more output.
You can double-check this by calling output.nnz. (attribute, not function)
If you got enough memory, you can use output.todense() to obtain classic non-sparse numpy-arrays.
Otherwise look up the docs to see how to work with these more efficiently.
scipy sparse docs
Remark: your example output[2][4] shows that you are new to numpy/scipy and i highly recommend going through their docs. Indexing 2d-arrays / matrices is done like output[2,4]
Im working with python, sklearn and numpy and I am creating the following sparse matrix:
feats = tfidf_vect.fit_transform(np.asarray(tweets))
print(feats)
feats=np.log(np.asarray(feats))
but I am getting the following error when I apply the log:
Traceback (most recent call last):
File "src/ef_tfidf.py", line 100, in <module>
feats=np.log(np.asarray(feats))
AttributeError: log
the error is related with the fact that feats it's a sparse matrix I would appreciate any help with this, I mean a way to apply the log to a sparse matrix.
The correct way to convert a sparse matrix to an ndarray is with the toarray method:
feats = np.log(feats.toarray())
np.array doesn't understand sparse matrix inputs.
If you want to only take the log of non-zero entries and return a sparse matrix of results, the best way would probably be to take the logarithm of the matrix's data and build a new sparse matrix with that data.
How that works through the public interface is different for different sparse matrix types; you'd want to look up the constructor for whatever type you have. Alternatively, there's the private _with_data method:
feats = feats._with_data(np.log(feats.data), copy=True)
So I actually needed to take something like log(p+1) for some sparse matrix p and I found this scipy method log1p which returns exactly that on a sparse matrix. I don't have enough reputation to comment so I'm just putting this here in case it helps anyone.
You could apply this to the original question with
feats = (feats-1).log1p()
This has the advantage of keeping feats sparse.
fit_transform()returns scipy.sparse.coo_matrix object, which has data attribute linked to data array of the sparse matrix
You can use the data attribute to manipulate non-zero data of coo sparse matrix directly, as following:
feats.data = np.log(feats.data)
I have a huge sparse matrix. I would like to save the dense equivalent one into file system.
The problem is the memory limit on my machine.
My original idea is:
convert huge_sparse_matrix to ndarray by np.asarray(huge_sparse_matrix)
assign values
save it back to file system
However, at step 1, Python raises MemoryError.
One possible approach in my mind is:
create a chunk of the dense array
assign values from the corresponding sparse one
save the dense array chunk back to file system
repeat 1-3
But how to do that?
you can use the scipy.sparse function to read sparse matrix and then convert it to numpy , see documentation here scipy.sparse docs and examples
I think np.asarray() is not really the function you're looking for.
You might try the SciPy matrix format cco_matrix() (coordinate formatted matrix).
scipy.sparse.coo_matrix
this format allows to save huge sparse matrices in very little memory.
furthermore there are many mathematical scipy functions which also work with this matrix format.
The matrix representation in this format are basically three lists:
row: the index of the row
col: the index of the column
data: the value at this position
hope that helped, cheers
The common and most straightforward answer to memory problems is: Do not create objects, use an iterator or a generator.
If I understand correctly, you have a sparse matrix and you want to transform it into a list representation. Here's a sample code:
def iter_sparse_matrix ( m, d1, d2 ):
for i in xrange(d1):
for j in xrange(d2):
if m[i][j]:
yield ( i, j, m[i][j] )
dense_array = list(iter_sparse_matrix(m, d1, d2))
You might also want to look here:
http://cvxopt.org/userguide/matrices.html#sparse-matrices
If I'm not wrong the problem you have is that the dense of the sparse matrix does not fit in your memory, and thus, you are not able to save it.
What I would suggest you is to use HDF5. HDF5 handles big data in disk passing it to memory only when needed.
I something like this should work:
import h5py
data = # your sparse matrix
cx = data.tocoo() # coo sparse representation
This will create your data matrix (of zeros) in disk.
f = h5py.File('dset.h5','w')
dataset = f.create_dataset("data", data.shape)
Fill the matrix with the sparse data:
dataset[cx.row, cx.col] = cx.data
Add any modifications you want to dataset:
dataset[something, something] = something
And finally, save it:
file.close()
The way HDF5 works I think is perfect for your needs. The matrix is stored always in disk, so it doesn't require memory, however, you can operate with it as if it was a standard numpy matrix (indexing, slicing, np.(..) operations and so on) and the h5py driver will send the parts of the matrix that you need to memory (never the whole matrix unless you specifically require it with something like data[:, :]).
PS: I'm assuming your sparse matrix is one of the scipy's sparse matrix. If not replace cx.row, cx.col and cx.data from the ones provided by your matrix representation (should be something like it).