Using toarray() method shows memory error - python

xtrain,xtest,ytrain,ytest = train_test_split(df_train['clean_comments'],df_train['label'].values,test_size=0.3,shuffle = True)
vectorizer = TfidfVectorizer(strip_accents='unicode',analyzer='word',ngram_range=(1,3),norm='l2')
vectorizer.fit(xtrain)
x_train = vectorizer.transform(xtrain)
x_train = x_train.toarray()
I am trying to convert a sparse array to dense array using toarray() method but it shows memory error.
I've already tried todense() method but it didn't work too.

Sparse matrices are used to store only the values which are different than zeros in memory and are therefore very well adapted for bag of words matrices. If you try to convert a sparse matrix to a dense format, it consumes a lot more memory since it also stores the zeros. If you don't have enough memory it raises an out of memory error.

Related

Handling large pandas dataframes

UPDATED question:
I have a 120000x14000 matrix that is sparse. Then I want to do some matrix algebra:
c = np.sum(indM, axis=1).T
w = np.diag(1 / np.array(c)[0]) # Fails with memory error
w = sparse.eye(len(indM), dtype=np.float)/np.array(c)[0] # Fails with memory error
w = np.nan_to_num(w)
u = w # indM # Fails with 'Object types not supported'
u_avg = np.array(np.sum(u, axis=0) / np.sum(indM, axis=0))[0]
So the problem is that the above first fails with memory error when creating a diagonal matrix with non-integers in the diagonal. If I manage to procese, the kernel somehow don't recognize "Objects" as supported types meaning I can't do sparse matrices, I think?
What do you recommend I do?
Try using numpy's sum. In my experience, it tends to blow other stuff out of the water when it comes to performance.
import numpy as np
c = np.sum(indM,axis=1)
It sounds like you don't have enough RAM to handle such a large array. The obvious choice here is to use methods from scipy.sparse but you say you've tried that and still encounter a memory problem. Fortunately, there are still a few other options:
Change your dataframe to a numpy array (this may reduce memory overhead)
You could use numpy.memmap to map your array to a location stored in binary on disk.
At the expense of precision, you could change the dtype of any floats from float64 (the default) to float32.
If you are loading your data from a .csv file, pd.read_csv has an option chunksize which allows you to read in your data in chunks.
Try using a cloud-based resource like Kaggle. There may be more processing power available there than on your machine.

How to apply log to an numpy sparse matrix elementwise

Im working with python, sklearn and numpy and I am creating the following sparse matrix:
feats = tfidf_vect.fit_transform(np.asarray(tweets))
print(feats)
feats=np.log(np.asarray(feats))
but I am getting the following error when I apply the log:
Traceback (most recent call last):
File "src/ef_tfidf.py", line 100, in <module>
feats=np.log(np.asarray(feats))
AttributeError: log
the error is related with the fact that feats it's a sparse matrix I would appreciate any help with this, I mean a way to apply the log to a sparse matrix.
The correct way to convert a sparse matrix to an ndarray is with the toarray method:
feats = np.log(feats.toarray())
np.array doesn't understand sparse matrix inputs.
If you want to only take the log of non-zero entries and return a sparse matrix of results, the best way would probably be to take the logarithm of the matrix's data and build a new sparse matrix with that data.
How that works through the public interface is different for different sparse matrix types; you'd want to look up the constructor for whatever type you have. Alternatively, there's the private _with_data method:
feats = feats._with_data(np.log(feats.data), copy=True)
So I actually needed to take something like log(p+1) for some sparse matrix p and I found this scipy method log1p which returns exactly that on a sparse matrix. I don't have enough reputation to comment so I'm just putting this here in case it helps anyone.
You could apply this to the original question with
feats = (feats-1).log1p()
This has the advantage of keeping feats sparse.
fit_transform()returns scipy.sparse.coo_matrix object, which has data attribute linked to data array of the sparse matrix
You can use the data attribute to manipulate non-zero data of coo sparse matrix directly, as following:
feats.data = np.log(feats.data)

Load a huge sparse array and save it back as a dense array

I have a huge sparse matrix. I would like to save the dense equivalent one into file system.
The problem is the memory limit on my machine.
My original idea is:
convert huge_sparse_matrix to ndarray by np.asarray(huge_sparse_matrix)
assign values
save it back to file system
However, at step 1, Python raises MemoryError.
One possible approach in my mind is:
create a chunk of the dense array
assign values from the corresponding sparse one
save the dense array chunk back to file system
repeat 1-3
But how to do that?
you can use the scipy.sparse function to read sparse matrix and then convert it to numpy , see documentation here scipy.sparse docs and examples
I think np.asarray() is not really the function you're looking for.
You might try the SciPy matrix format cco_matrix() (coordinate formatted matrix).
scipy.sparse.coo_matrix
this format allows to save huge sparse matrices in very little memory.
furthermore there are many mathematical scipy functions which also work with this matrix format.
The matrix representation in this format are basically three lists:
row: the index of the row
col: the index of the column
data: the value at this position
hope that helped, cheers
The common and most straightforward answer to memory problems is: Do not create objects, use an iterator or a generator.
If I understand correctly, you have a sparse matrix and you want to transform it into a list representation. Here's a sample code:
def iter_sparse_matrix ( m, d1, d2 ):
for i in xrange(d1):
for j in xrange(d2):
if m[i][j]:
yield ( i, j, m[i][j] )
dense_array = list(iter_sparse_matrix(m, d1, d2))
You might also want to look here:
http://cvxopt.org/userguide/matrices.html#sparse-matrices
If I'm not wrong the problem you have is that the dense of the sparse matrix does not fit in your memory, and thus, you are not able to save it.
What I would suggest you is to use HDF5. HDF5 handles big data in disk passing it to memory only when needed.
I something like this should work:
import h5py
data = # your sparse matrix
cx = data.tocoo() # coo sparse representation
This will create your data matrix (of zeros) in disk.
f = h5py.File('dset.h5','w')
dataset = f.create_dataset("data", data.shape)
Fill the matrix with the sparse data:
dataset[cx.row, cx.col] = cx.data
Add any modifications you want to dataset:
dataset[something, something] = something
And finally, save it:
file.close()
The way HDF5 works I think is perfect for your needs. The matrix is stored always in disk, so it doesn't require memory, however, you can operate with it as if it was a standard numpy matrix (indexing, slicing, np.(..) operations and so on) and the h5py driver will send the parts of the matrix that you need to memory (never the whole matrix unless you specifically require it with something like data[:, :]).
PS: I'm assuming your sparse matrix is one of the scipy's sparse matrix. If not replace cx.row, cx.col and cx.data from the ones provided by your matrix representation (should be something like it).

Python: memory error while changing data type from integer to float

I have a array of size 13000*300000 filled with integer from 0 to 255. I would like to change their data type from integer to float as if data is a numpy array:
data.astype('float')
While changing its data type from integer to float, it shows memory error. I have 80 GB of RAM. It still shows memory error. Could you please let me know what can be the reason for it?
The problem here is that data is huge (about 30GB of sequential data, see How much memory in numpy array?), hence it causes the error while trying to fit it into the memory. Instead of doing the operation on whole, slice it and then do the operation and then merge, like:
n = 300000
d1 = data[:, :n/2].astype('float')
d2 = data[:, n/2:].astype('float')
data = np.hstack(d1, d2)
Generally, since your data size is so unwieldy, consider consuming it in parts to avoid being bitten by these sorts of problems all the time (see Techniques for working with large Numpy arrays? for this and other techniques).

Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

When using pytables, there's no support (as far as I can tell) for the scipy.sparse matrix formats, so to store a matrix I have to do some conversion, e.g.
def store_sparse_matrix(self):
grp1 = self.getFileHandle().createGroup(self.getGroup(), 'M')
self.getFileHandle().createArray(grp1, 'data', M.tocsr().data)
self.getFileHandle().createArray(grp1, 'indptr', M.tocsr().indptr)
self.getFileHandle().createArray(grp1, 'indices', M.tocsr().indices)
def get_sparse_matrix(self):
return sparse.csr_matrix((self.getGroup().M.data, self.getGroup().M.indices, self.getGroup().M.indptr))
The trouble is that the get_sparse function takes some time (reading from disk), and if I understand it correctly also requires the data to fit into memory.
The only other option seems to convert the matrix to dense format (numpy array) and then use pytables normally. However this seems to be rather inefficient, although I suppose perhaps pytables will deal with the compression itself?
Borrowing from Storing numpy sparse matrix in HDF5 (PyTables), you can marshal a scipy.sparse array into a pytables format using its data, indicies, and indptr attributes, which are three regular numpy.ndarray objects.

Categories

Resources