I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.
KNN:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
Error:
File "knn.py", line 48, in <module>
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
K-Means:
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
Error:
Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:
lengths = [len(line) for line in X]
Then take a look to see whether all lines have the same length, by invoking
np.unique(lengths)
If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.
Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.
Here is an example of what happens if line lengths are not the same:
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype # returns dtype('O')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X) # raises your first error
from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X) # raises your second error
Related
For my Deep Learning preprocessing, I need to get the terms of frequency of each label.
Each label has already be cleaned (stopwords, punctuation, etc.)
So I used TfidfVectorizer to get the terms of frequency of each label.
The fit_transform function of TfidfVectorizer returns scipy.sparse.csr_matrix.
And next, I need to convert it to a numpy array with toarray() method (to concatenate with my others features).
Code to reproduce:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def get_preprocessed_labels_count(data):
"""This function get the preprocessed labels.
Args:
data (pd.DataFrame): dataframe with the preprocessed data
Returns:
pd.DataFrame: dataframe with the preprocessed labels
"""
labels = data["Labels"].values
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
labels_count_np = vectorizer.fit_transform(labels)
labels_count_np = labels_count_np.toarray()
return labels_count_np
data = pd.DataFrame({"Labels": ["this is a test", "of course now it is working",
"but when there are lot of data, there is memory error"]})
print(get_preprocessed_labels_count(data))
The problem is that toarray() throws an error when converting too much data.
Traceback (most recent call last):
File "main.py", line 114, in main
train_and_test(config)
File "/storage/simple/projects/train.py", line 676, in train_and_test
data = get_processed_data(config)
File "/storage/simple/projects/train.py", line 60, in get_processed_data
data["Labels_Count"] = get_preprocessed_labels_count(data)
File "/storage/simple/projects/process_data.py", line 175, in get_preprocessed_labels_count
labels_count_np = labels_count_np.toarray().tolist()
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_compressed.py", line 1051, in toarray
out = self._process_toarray_args(order, out)
File "/home/juana/.local/lib/python3.8/site-packages/scipy/sparse/_base.py", line 1288, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate N GiB for an array with shape (x, y) and data type float64
For example, with x = 70898 and y = 352302, I have N = 186. GiB of memory (error).
I tried methods to reduce the memory consumption, but it did not work (reducing size of array with asformat(), computing in calc server with 128GB RAM, etc.).
Does it exist a way maybe to keep csr_matrix and concatenate with numpy without converting to numpy array?
Is there a better way to convert ?
I hope someone can help me.
I'm trying to use the PGMPY package for python to learn the parameters of a bayesian network. If I understand expectation maximization correctly, it should be able to deal with missing values. I am currently experimenting with a 3 variable BN, where the first 500 datapoints have a missing value. There are no latent variables. Although the description in pgmpy suggests that it should work with missing values, I get an error. This error only occurs when calling the function with datapoints that have missing values. Am I doing something wrong? Or should I look into another package for EM with missing values?
#import
import numpy as np
import pandas as pd
from pgmpy.estimators import BicScore, ExpectationMaximization
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import HillClimbSearch
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
data = pd.DataFrame(data, columns=["Smoker", "LungCancer", "X-ray"])
test_data = data[:2000]
new_data = data[2000:]
# Learn structure of initial model from data
bic = BicScore(test_data)
hc = HillClimbSearch(test_data)
model = hc.estimate(scoring_method=bic)
# create some missing values
new_data["Smoker"][:500] = np.NaN
# learn parameterization of BN
bn = BayesianNetwork(model)
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
The error I get is an indexing error:
File "main.py", line 100, in <module>
bn.fit(new_data, estimator=ExpectationMaximization, complete_samples_only=False)
File "C:\Python38\lib\site-packages\pgmpy\models\BayesianNetwork.py", line 585, in fit
cpds_list = _estimator.get_parameters(n_jobs=n_jobs, **kwargs)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 213, in get_parameters
weighted_data = self._compute_weights(latent_card)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in _compute_weights
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pandas\core\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 727, in apply
return self.apply_standard()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Python38\lib\site-packages\pandas\core\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 100, in <lambda>
weights = df.apply(lambda t: self._get_likelihood(dict(t)), axis=1)
File "C:\Python38\lib\site-packages\pgmpy\estimators\EM.py", line 76, in _get_likelihood
likelihood *= cpd.get_value(
File "C:\Python38\lib\site-packages\pgmpy\factors\discrete\DiscreteFactor.py", line 195, in get_value
return self.values[tuple(index)]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Thanks!
Since there is still no answer to your specific question, let me propose a solution with another module:
#import
import pandas as pd
import numpy as np
import pyAgrum as gum
# Read data that does not contain any missing values
data = pd.read_csv("asia10K.csv")
# not exactly the same names
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])
test_data = data[:2000]
new_data = data[2000:].copy()
# Learn structure of initial model from data
learner=gum.BNLearner(test_data)
learner.useScoreBIC()
learner.useGreedyHillClimbing()
model=learner.learnBN()
# create some missing values
new_data["smoking"][:500] = "?" # instead of NaN
# learn parameterization of BN
bn = gum.BayesNet(model)
learner2=gum.BNLearner(new_data,model)
learner2.useEM(1e-10)
learner2.fitParameters(bn)
In a notebook :
Data source can be found here.
Hello all,
I've hit a stumbling block in some code I'm writing because the fit_transform method continuously fails. It throws this error:
Traceback (most recent call last):
File "/home/user/Datasets/CSVs/Working/Playstore/untitled0.py", line 18, in <module>
data = data[oh_cols].apply(oh.fit_transform)
File "/usr/lib/python3.8/site-packages/pandas/core/frame.py", line 7547, in apply
return op.get_result()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 180, in get_result
return self.apply_standard()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/lib/python3.8/site-packages/pandas/core/apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 410, in fit_transform
return super().fit_transform(X, y)
File "/usr/lib/python3.8/site-packages/sklearn/base.py", line 690, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
X_list, n_samples, n_features = self._check_X(X)
File "/usr/lib/python3.8/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
X_temp = check_array(X, dtype=None)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/lib/python3.8/site-packages/sklearn/utils/validation.py", line 620, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=['Everyone' 'Everyone' 'Everyone' ... 'Everyone' 'Mature 17+' 'Everyone'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
To put short:
ValueError: Expected 2D array, got 1D array instead:
I've done some searching on this online and arrived at a few potential solutions, but they didn't seem to work.
Here's my code:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import CatBoostEncoder,CountEncoder,TargetEncoder
data = pd.read_csv("/home/user/Datasets/CSVs/Working/Playstore/data.csv")
oh = OneHotEncoder()
cb = CatBoostEncoder()
ce = CountEncoder()
te = TargetEncoder()
obj = [i for i in data if data[i].dtypes=="object"]
unique = dict(zip(list(obj),[len(data[i].unique()) for i in obj]))
oh_cols = [i for i in unique if unique[i] < 100]
te_cols = [i for i in unique if unique[i] > 100]
data = data[oh_cols].apply(oh.fit_transform)
It throws the aforementioned error. A solution I saw advised me to use .values when transforming the data and I tried the following:
data = data[oh_cols].values.apply(oh.fit_transform)
data = data[oh_cols].apply(oh.fit_transform).values
encoding = np.array(data[oh_cols])
encoding.apply(oh.fit_transform)
The first and the third threw the same error which is below,:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
While the second threw the first error I mentioned again:
ValueError: Expected 2D array, got 1D array instead:
I'm honestly stumped and I'm not sure where to go from here. The Kaggle exercise I learnt this from went smoothly, but for some reason things never do when I try my hand at things myself.
The fix
data_enc = oh.fit_transform(data[oh_cols])
This is much better than the apply approach anyway, because now the object oh has lots of useful information in it when you want to inspect the results, you can later oh.transform your test data, etc.
Explaining the errors
Your data is in a pandas DataFrame object. The pandas function apply is trying to apply oh.fit_transform to each column, but OneHotEncoder expects a 2D input.
Using .values or np.array() casts your dataframe to a numpy array, but numpy has no apply method.
During a NLP process, I transform a corpus of texts using TF-IDF which yields a scipy.sparse.csr.csr_matrix.
I then split this data into train and test corpus and resample my train corpus in order to tackle a class imbalance problem.
The issue I'm facing is that when I use the resampled index (from the label which is of type pandas.Series) to resample the sparse matrix like this:
tfs[Ytr_resample.index]
It takes a lot of time, and outputs the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]
/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
348 csr_sample_values(self.shape[0], self.shape[1],
349 self.indptr, self.indices, self.data,
--> 350 num_samples, row.ravel(), col.ravel(), val)
351 if row.ndim == 1:
352 # row and col are 1d
ValueError: could not convert integer scalar
Following this thread I checked that the biggest element in the index wouldn't be bigger than the number of rows in my sparse matrix.
The problem seems to come from the fact that the index is coded in np.int64 and not in np.int32. Indeed the following works:
Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]
Therefore I have two questions:
Is the error actually coming from this conversion int32 to int64?
Is there a more pythonic way to convert the index type? (Ytr_resample.index.astype(np.int32) does not seem to change the type for some reason)
EDIT:
Ytr_resample.index is of type pandas.core.indexes.numeric.Int64Index:
Int64Index([1484, 753, 1587, 1494, 357, 1484, 84, 823, 424, 424,
...
1558, 1619, 1317, 1635, 537, 1206, 1152, 1635, 1206, 131],
dtype='int64', length=4840)
I created Ytr_resample by resampling Ytr (which is pandas.Series) such that every category present in Ytr has the same number of elements (by oversampling):
n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
lambda x: x.sample(n_samples,replace = True,random_state=42)).cat
I created the below table in Google Sheets and downloaded it as a CSV file.
My code is posted below. I'm really not sure where it's failing. I tried to highlight and run the code line by line and it keeps throwing that error.
# Data Preprocessing
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv('Data2.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 5].values
# Replace Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5 ])
X[:, 1:6] = imputer.transform(X[:, 1:5])
The error I'm getting is:
Could not convert string to float: 'Illinois'
I also have this line above my error message
array = np.array(array, dtype=dtype, order=order, copy=copy)
It seems like my code is not able to read my GPA column which contains floats. Maybe I didn't create that column right and have to specify that they're floats?
*** I'm updating with the full error message:
[15]: runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
Traceback (most recent call last):
File "<ipython-input-15-5f895cf9ba62>", line 1, in <module>
runfile('/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py', wdir='/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing')
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/jim/Desktop/Machine Learning Class/Part 1/Machine Learning A-Z Template Folder/Part 1 - Data Preprocessing/data_preprocessing_template2.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/imputation.py", line 155, in fit
force_all_finite=False)
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'Illinois'
Actually the full error you are getting is this (which would help tremendously if you pasted it in full):
Traceback (most recent call last):
File "<ipython-input-7-6a92ceaf227a>", line 8, in <module>
imputer = imputer.fit(X[:, 1:5 ])
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\Fatih\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: Illinois
which, if you look carefully, points out where it is failing:
imputer = imputer.fit(X[:, 1:5 ])
which is due to your effort in taking mean of a categorical variable, which, doesn't make sense, and
which is already asked and answered in this StackOverflow thread.
Change the line:
dataset = pd.read_csv('Data2.csv')
by:
dataset = pd.read_csv('Data2.csv', delimiter=";")