For my Deep Learning preprocessing, I need to get the terms of frequency of each label.
Each label has already be cleaned (stopwords, punctuation, etc.)
So I used TfidfVectorizer to get the terms of frequency of each label.
The fit_transform function of TfidfVectorizer returns scipy.sparse.csr_matrix.
And next, I need to convert it to a numpy array with toarray() method (to concatenate with my others features).
Code to reproduce:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def get_preprocessed_labels_count(data):
"""This function get the preprocessed labels.
Args:
data (pd.DataFrame): dataframe with the preprocessed data
Returns:
pd.DataFrame: dataframe with the preprocessed labels
"""
labels = data["Labels"].values
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
labels_count_np = vectorizer.fit_transform(labels)
labels_count_np = labels_count_np.toarray()
return labels_count_np
data = pd.DataFrame({"Labels": ["this is a test", "of course now it is working",
"but when there are lot of data, there is memory error"]})
print(get_preprocessed_labels_count(data))
The problem is that toarray() throws an error when converting too much data.
Traceback (most recent call last):
File "main.py", line 114, in main
train_and_test(config)
File "/storage/simple/projects/train.py", line 676, in train_and_test
data = get_processed_data(config)
File "/storage/simple/projects/train.py", line 60, in get_processed_data
data["Labels_Count"] = get_preprocessed_labels_count(data)
File "/storage/simple/projects/process_data.py", line 175, in get_preprocessed_labels_count
labels_count_np = labels_count_np.toarray().tolist()
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_compressed.py", line 1051, in toarray
out = self._process_toarray_args(order, out)
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_base.py", line 1288, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate N GiB for an array with shape (x, y) and data type float64
For example, with x = 70898 and y = 352302, I have N = 186. GiB of memory (error).
I tried methods to reduce the memory consumption, but it did not work (reducing size of array with asformat(), computing in calc server with 128GB RAM, etc.).
Does it exist a way maybe to keep csr_matrix and concatenate with numpy without converting to numpy array?
Is there a better way to convert ?
I hope someone can help me.
Related
Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.
I'm trying to combine two types of parameters before clustering.
My parameters are Text - represented as sparse matrix,
and another array representing other features of my data point.
I've tried to combine the 2 types of parameters into 1 array and passing it as an input to the algo:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance).fit(array(combined_list))
Also I've built a custom distance metric which I'm going to use.
def get_distance(vec1,vec2):
text_distance = cosine_similarity(vec1[0] ,vec2[0])
other_distance = vec1[1]-vec2[1]
return (text_distance+other_distance)/2
But I'm getting an error when trying to pass my input array.
The combined array was constructed as following:
combined_list = []
for i in range(len(hashes_list)):
combined_list.append((hashes_list[i],text_list[i]))
combined_list = array(combined_list)
Full Error Traceback:
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(array(combined_list))
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/cluster/dbscan_.py", line 319, in fit
X = check_array(X, accept_sparse='csr')
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 527, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/Users/tal/src/campaign_detection/Data_Extractor/venv/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Is this the correct approach for combining text vector with other parameters?
I have couple of suggestions for your approach.
Input for DBSCAN has to be fed with array of 2D and not tuples. Hence you have to flatten your input data.
From Documentation:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape
(n_samples, n_samples)
get_distance() has to return single value and not a array. Hence, I would suggest you to use some measure for not text features. I have given an example for euclidean distance.
Example:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> text_list = vectorizer.fit_transform(corpus)
import numpy as np
hashes_list = np.array([[12,12,12],
[12,13,11],
[12,1,16],
[4,8,11]])
from scipy.sparse import hstack
combined_list = hstack((hashes_list,text_list))
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
n1 = len(vectorizer.get_feature_names())
def get_distance(vec1,vec2):
text_distance = cosine_similarity([vec1[:n1]], [vec2[:n1]])
other_distance = euclidean_distances([vec1[n1:]], [vec2[n1:]])
return (text_distance+other_distance)/2
db = DBSCAN(eps=1, min_samples=3, metric=get_distance ).fit(combined_list.toarray())
I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with tokenized, stemmed descriptions for a number of apps in each of these two categories. The script then creates a tfidf matrix and attempts to produce a cosine similarity matrix.
I updated Anaconda 64 bit for Windows yesterday to make sure I have the latest versions of Python, numpy, scipy, and scikit-learn.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
print ('reading file into pandas')
data = pd.read_csv(os.path.join('inputfile.csv'))
cats = np.unique(data['category'])
for i in cats:
print ()
print ('prepping', i)
d2 = data[data.category == i]
descStem = d2.descStem.tolist()
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
print (tfidf_matrix.shape)
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
The script works just fine for the smaller category of comics, with a tdidf_matrix.shape = (3119, 8217). However, I receive the error message below for the larger category of education, with a tfidf_matrix.shape = (90327, 62863). This matrix is larger than 2^32.
Traceback (most recent call last):
File "<ipython-input-1-4b2586ddeca4>", line 1, in <module>
runfile('Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py', wdir='Z:/rangus/gplay/marcello/data/similarity/error')
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "Z:/rangus/gplay/marcello/data/similarity/error/cosSimByCatScrapeError.py", line 23, in <module>
cosOrig = cosine_similarity(tfidf_matrix, tfidf_matrix)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py", line 918, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\sklearn\utils\extmath.py", line 186, in safe_sparse_dot
ret = ret.toarray()
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "F:\u0137777\Continuum\Anaconda3\lib\site-packages\scipy\sparse\coo.py", line 258, in toarray
B.ravel('A'), fortran)
ValueError: could not convert integer scalar
I can overcome this error by running the code below, but using a dense matrix is a massive memory hog and I need to run this script on 40+ categories.
print ('vectorizing', i)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(descStem)
tfidf_matrixD = tfidf_matrix.toarray()
print ('calculating cosine sim', i)
cosOrig = cosine_similarity(tfidf_matrixD, tfidf_matrixD)
This is the closest similar issue I could find on StackOverflow, but I couldn't see out how it would help my situation...
I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error
I process rather large matrices in Python/Scipy. I need to extract rows from large matrix (which is loaded to coo_matrix) and use them as diagonal elements. Currently I do that in the following fashion:
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A[i,:].todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csc")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
What I see from the profile output is that most of the time is consumed by get_csr_submatrix function while extracting diag_elems. That makes me think that I use either inefficient sparse representation of initial data or wrong way of extracting row from a sparse matrix. Can you suggest a better way to extract a row from a sparse matrix and represent it in a diagonal form?
EDIT
The following variant removes bottleneck from the row extraction (notice that simple changing 'csc' to csr is not sufficient, A[i,:] must be replaced with A.getrow(i) as well). However the main question is how to omit the materialization (.todense()) and create the diagonal matrix from the sparse representation of the row.
import numpy as np
from scipy import sparse
def computation(A):
for i in range(A.shape[0]):
diag_elems = np.array(A.getrow(i).todense())
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1], format = "csc")
#...
#create some random matrix
A = (sparse.rand(1000,100000,0.02,format="csr")*5).astype(np.ubyte)
#get timings
profile.run('computation(A)')
If I create DIAgonal matrix from 1-row CSR matrix directly, as follows:
diag_elems = A.getrow(i)
ith_diag = sparse.spdiags(diag_elems,0,A.shape[1],A.shape[1])
then I can neither specify format="csc" argument, nor convert ith_diags to CSC format:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/profile.py", line 70, in run
prof = prof.run(statement)
File "/usr/local/lib/python2.6/profile.py", line 456, in run
return self.runctx(cmd, dict, dict)
File "/usr/local/lib/python2.6/profile.py", line 462, in runctx
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "<stdin>", line 4, in computation
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/construct.py", line 56, in spdiags
return dia_matrix((data, diags), shape=(m,n)).asformat(format)
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/base.py", line 211, in asformat
return getattr(self,'to' + format)()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/dia.py", line 173, in tocsc
return self.tocoo().tocsc()
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/coo.py", line 263, in tocsc
data = np.empty(self.nnz, dtype=upcast(self.dtype))
File "/usr/local/lib/python2.6/site-packages/scipy/sparse/sputils.py", line 47, in upcast
raise TypeError,'no supported conversion for types: %s' % args
TypeError: no supported conversion for types: object`
Here's what I came up with:
def computation(A):
for i in range(A.shape[0]):
idx_begin = A.indptr[i]
idx_end = A.indptr[i+1]
row_nnz = idx_end - idx_begin
diag_elems = A.data[idx_begin:idx_end]
diag_indices = A.indices[idx_begin:idx_end]
ith_diag = sparse.csc_matrix((diag_elems, (diag_indices, diag_indices)),shape=(A.shape[1], A.shape[1]))
ith_diag.eliminate_zeros()
Python profiler said 1.464 seconds versus 5.574 seconds before. It takes advantage of the underlying dense arrays (indptr, indices, data) that define sparse matrices. Here's my crash course: A.indptr[i]:A.indptr[i+1] defines which elements in the dense arrays correspond to the non-zero values in row i. A.data is a dense 1d array of non-zero the values of A and A.indptr is the column where those values go.
I would do some more testing to make very certain this does the same thing as before. I only checked a few cases.