During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix.
I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.
The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series) to resample the sparse matrix like this:
tfs[Ytr_resample.index]
It takes a lot of time, and outputs the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]
/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
348 csr_sample_values(self.shape[0], self.shape[1],
349 self.indptr, self.indices, self.data,
--> 350 num_samples, row.ravel(), col.ravel(), val)
351 if row.ndim == 1:
352 # row and col are 1d
ValueError: could not convert integer scalar
Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.
The problem seems to come from the fact that the index is coded in np.int64 and not in np.int32. Indeed the following works:
Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]
Therefore I have two questions:
Is the error actually coming from this conversion int32 to int64?
Is there a more pythonic way to convert the index type? (Ytr_resample.index.astype(np.int32) does not seem to change the type for some reason)
EDIT:
Ytr_resample.index is of type pandas.core.indexes.numeric.Int64Index:
Int64Index([1484, 753, 1587, 1494, 357, 1484, 84, 823, 424, 424,
...
1558, 1619, 1317, 1635, 537, 1206, 1152, 1635, 1206, 131],
dtype='int64', length=4840)
I created Ytr_resample by resampling Ytr (which is pandas.Series) such that every category present in Ytr has the same number of elements (by oversampling):
n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
lambda x: x.sample(n_samples,replace = True,random_state=42)).cat
Related
So, as the title says, I'm trying to calculate the probability of a value given a list of samples, preferably normalized so the probability is 0<p<1. I found this answer on the topic from about 6 years ago, which seemed promising. To test it, I implemented the example used in the first reply (edited for brevity):
import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
x = np.linspace(0, 16, 1000)[:, np.newaxis]
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
However, this approach throws the following error:
Traceback (most recent call last):
File "prob test.py", line 44, in <module>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 340, in quad
retval = _quad(func, a, b, args, full_output, epsabs, epsrel, limit,
File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
File "prob test.py", line 44, in <lambda>
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
File "/usr/lib/python3/dist-packages/sklearn/neighbors/_kde.py", line 190, in score_samples
X = check_array(X, order='C', dtype=DTYPE)
File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 545, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=5.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I'm not sure how to reshape the distribution when its already inside the lambda function, and, in any case, I'm guessing this is happening because Scikit-Learn has been updated in the 6 years since this answer was written. What's the best way to work around this issue to get the probability value?
Thanks!
As said in the library:
score_samples(X): X array-like of shape (n_samples, n_features)
Therefore, you should pass an array-like and not a scalar:
probability = quad(lambda x: np.exp(kd.score_samples(np.array([[x]]))), start, end)
or:
probability = quad(lambda x: np.exp(kd.score_samples(np.array([x]).reshape(-1,1))), start, end)
For my Deep Learning preprocessing, I need to get the terms of frequency of each label.
Each label has already be cleaned (stopwords, punctuation, etc.)
So I used TfidfVectorizer to get the terms of frequency of each label.
The fit_transform function of TfidfVectorizer returns scipy.sparse.csr_matrix.
And next, I need to convert it to a numpy array with toarray() method (to concatenate with my others features).
Code to reproduce:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def get_preprocessed_labels_count(data):
"""This function get the preprocessed labels.
Args:
data (pd.DataFrame): dataframe with the preprocessed data
Returns:
pd.DataFrame: dataframe with the preprocessed labels
"""
labels = data["Labels"].values
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
labels_count_np = vectorizer.fit_transform(labels)
labels_count_np = labels_count_np.toarray()
return labels_count_np
data = pd.DataFrame({"Labels": ["this is a test", "of course now it is working",
"but when there are lot of data, there is memory error"]})
print(get_preprocessed_labels_count(data))
The problem is that toarray() throws an error when converting too much data.
Traceback (most recent call last):
File "main.py", line 114, in main
train_and_test(config)
File "/storage/simple/projects/train.py", line 676, in train_and_test
data = get_processed_data(config)
File "/storage/simple/projects/train.py", line 60, in get_processed_data
data["Labels_Count"] = get_preprocessed_labels_count(data)
File "/storage/simple/projects/process_data.py", line 175, in get_preprocessed_labels_count
labels_count_np = labels_count_np.toarray().tolist()
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_compressed.py", line 1051, in toarray
out = self._process_toarray_args(order, out)
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_base.py", line 1288, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate N GiB for an array with shape (x, y) and data type float64
For example, with x = 70898 and y = 352302, I have N = 186. GiB of memory (error).
I tried methods to reduce the memory consumption, but it did not work (reducing size of array with asformat(), computing in calc server with 128GB RAM, etc.).
Does it exist a way maybe to keep csr_matrix and concatenate with numpy without converting to numpy array?
Is there a better way to convert ?
I hope someone can help me.
Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.
I have two lists Y_train and Y_test. At the moment they hold categorical data. Each element is either Blue or Green. They are going to be the targets for a Random Forest classifier. I need them encoded as 1.0s and 0.0s.
Here is a print(Y_train) to show you what the data frame looks like. The random numbers down the side are because the data has been shuffled. (Y_test is the same, just smaller):
183 Blue
126 Blue
1 Blue
409 Blue
575 Green
...
396 Blue
192 Blue
578 Green
838 Green
222 Blue
Name: Colour, Length: 896, dtype: object
To encode this I was going to simply loop over them and change each element to their encoded values:
for i in range(len(Y_train)):
if Y_train[i] == 'Blue':
Y_train[i] = 0.0
else:
Y_train[i] = 1.0
However, when I do this, I get the following:
Traceback (most recent call last):
File "G:\Work\Colours.py", line 90, in <module>
Main()
File "G:\Work\Colours.py", line 34, in Main
RandForest(X_train, Y_train, X_test, Y_test)
File "G:\Work\Colours.py.py", line 77, in RandForest
if Y_train[i] == 'Blue':
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 6
The weird thing is that it produces this error at different times. I've used flags and prints to see how far it gets. Sometimes it will get quite a few iterations into the loop, and then other times it will only do one or two iterations before breaking.
I'm assuming I just don't quite understand how you're supposed to iterate over data frames properly. If someone with more experience with this stuff could help me out that would be great.
Try:
Y_train[Y_train == 'Blue']=0.0
Y_train[Y_train == 'Green']=1.0
That should solve your issues.
If you are using a your own method to label encoding,it is better to create a separate encoded column rather than modifying original column.After that you can assign encoded column to your dataframe. As a example for your scenario.
encoded = np.ones((Y_train.shape[0],1))
for i in range(Y_train.shape[0]):
if Y_train[i] == 'Blue':
encoded[i] = 0
Note that this will only work for if you have two categories.
for multiple categories,you can use sklearn or pandas methods.
For multiple categories
Another approach is using pandas cat.codes.You can convert pandas series to a category and get the category codes.
Y_train = pd.Series(Y_train)
encoded = Y_train.astype("category").cat.codes
You can use sklearn Labelencoder to encode categorical data as well.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(Y_train)
In cases where you even have more number of labels than your current example(Blue and Green in your case), sklearn provides a label encoder that allows you to do this very easily using
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Transforms the 'column' in your dataframe df
df['column']= label_encoder.fit_transform(df['column'])
I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error