signed integer is greater than maximum in scikit-learn in python

signed integer is greater than maximum in scikit-learn in python - python

I am working on sentiment analysis of around 30,000 tweets. python version is 2.7 on linux. In the training phase I am using nltk as a wrapper for sklearn library to apply different Classifiers such as Naive Bayes, LinearSVC, Logistic regression , etc.
It works fine when the number of tweets are like 10,000 but now I received error for 30,000 tweets on classifying Bigrams with Multinomial naive bayes in sklearn. Here is part of the implementation code after pre-processing and dividing to train and test sets :
import nltk
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,
training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
MNBAccuracy = nltk.classify.accuracy(MNB_classifier, testing_set)*100
print "-------- MultinomialNB --------"
print "RESULT : Matches " + str(int((testSize*MNBAccuracy)/100)) + ":"+ str(testSize)
print "MNB accuracy percentage:" + str(MNBAccuracy)
print ""
here the Error:
Traceback (most recent call last):
File "/home/sb402747/Desktop/Sentiment/sentiment140API/analysing/Classifier.py", line 83, in <module>
MNB_classifier.train(training_set)
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 176, in _transform
indptr.append(len(indices))
OverflowError: signed integer is greater than maximum
I guess the reason is because the number of indices in array is more that the maximum allowed for it on dict_vectore.py. I even tried to change the type of indices in dict_vectorizer.py from i to l but it didn't solve my problem and received this error:
Traceback (most recent call last):
File "/home/sb402747/Desktop/Sentiment/ServerBackup26-02-2016/analysing/Classifier.py", line 84, in <module>
MNB_classifier.train(training_set)
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 186, in _transform
shape=shape, dtype=dtype)
File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 88, in __init__
self.check_format(full_check=False)
File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 167, in check_format
raise ValueError("indices and data should have the same size")
ValueError: indices and data should have the same size
then discarded it and changed it back to i again. How can I solve this problem?

Hmm, looks like here:
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
nltk demands too big matrix as a result.
Maybe you can change it somehow, for example minimize number of features (words) in your text, or request for this result in two passes?
Also, are you trying to do this on latest numpy/scipy/scikit-learn stable releases?
Read this too: https://sourceforge.net/p/scikit-learn/mailman/message/31340515/

Related

Re-fitting a saved scikit-learn model without some features not used - "ValueError: A given column is not a column of the dataframe"

I'd need to re-fit a scikit-learn pipeline using a smaller dataset, without some features that are actually not used by the model.
(The actual situation is that I'm saving it through joblib and loading it in another file where I need to re-fit is since it contains some custom transformers I made, but adding all features would be a pain since it's a different kind of model. However this is not important since the same error happens also if I re-fit the model before saving it in the same file where I first trained it).
This is my custom transformer:
class TransformAdoptionFeatures(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
adoption_features = X.columns
feats_munic = [feat for feat in adoption_features if '_munic' in feat]
feats_adj_neigh = [feat for feat in adoption_features
if '_adj' in feat]
feats_port = [feat for feat in adoption_features if '_port' in feat]
feats_to_keep_all = feats_munic + feats_adj_neigh + feats_port
feats_to_keep = [feat for feat in feats_to_keep_all
if 'tot_cumul' not in feat]
return X[feats_to_keep]
And this is my pipeline:
full_pipeline = Pipeline([
('transformer', TransformAdoptionFeatures()),
('scaler', StandardScaler())
])
model = Pipeline([
("preparation", full_pipeline),
("regressor", ml_model)
])
Where ml_model is whichever scikit-learn machine learning model. Both the full_pipeline and the ml_model are already fitted when saving the model. (In the actual model there is a ColumnTransformer intermediate step that represent the actual full_pipeline, since I need to have different transformers for different columns, but I copied only the important one for brevity).
Issue: I reduced the number of features of the dataset I already used to fit everything, removing some features that are not considered in TransformAdoptionFeatures() (they do not get into the features to keep). Then, I tried to re-fit the model to the new dataset with reduced features and I got this error:
Traceback (most recent call last):
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'tot_cumul_adoption_pr_y_munic'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 447, in _get_column_indices
col_idx = all_columns.get_loc(col)
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 'tot_cumul_adoption_pr_y_munic'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\giaco\sbp-abm\municipalities_abm\test.py", line 15, in <module>
modelSBP = model.SBPAdoption(initial_year=start_year)
File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 103, in __init__
self._upload_ml_models(ml_clsf_folder, ml_regr_folder)
File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 183, in _upload_ml_models
self._ml_clsf.fit(clsf_dataset.drop('adoption_in_year', axis=1),
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 330, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 292, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\joblib\memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 740, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform
self._validate_remainder(X)
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder
cols.extend(_get_column_indices(X, columns))
File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 454, in _get_column_indices
raise ValueError(
ValueError: A given column is not a column of the dataframe
I do not understand what does this error is due to, I thought scikit-learn was not storing the name of the columns that I pass.

I found my error and it was actually in the use of the ColumnsTransformer, that is also the only place where the column names enter.
My error was really simple, I just did not update the list of the columns to apply each transformation to removing the names of the features excluded.

fitting training data from decision tree regressor causes crash

Trying to implement the decision tree regressor algorithm on some training data but when I call fit() I get an error.
(trainingData, testData) = data.randomSplit([0.7, 0.3])
vecAssembler = VectorAssembler(inputCols=["_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_10"], outputCol="features")
dt = DecisionTreeRegressor(featuresCol="features", labelCol="_11")
dt_model = dt.fit(trainingData)
Generates the error
File "spark.py", line 100, in <module>
main()
File "spark.py", line 87, in main
dt_model = dt.fit(trainingData)
File "/opt/spark/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/opt/spark/python/pyspark/ml/wrapper.py", line 295, in _fit
java_model = self._fit_java(dataset)
File "/opt/spark/python/pyspark/ml/wrapper.py", line 292, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
But the data structures are exactly the same.

You are missing two steps. 1. transformation part, and 2. selecting features and label from the transformed data. I assume the data contains only numerical data, i.e. no categorical data. I am going to write down a generic flow of training a model using pyspark.ml to help you.
from pyspark.ml.feature
from pyspark.ml.classification import DecisionTreeClassifier
#date processing part
vecAssembler = VectorAssembler(input_cols=['col_1','col_2',...,'col_10'],outputCol='features')
#you missed these two steps
trans_data = vecAssembler.transform(data)
final_data = trans_data.select('features','col_11') #your label column name is col_11
train_data, test_data = final_data.randomSplit([0.7,0.3])
#ml part
dt = DecisionTreeClassifier(featuresCol='features',labelCol='col_11')
dt_model = dt.fit(train_data)
dt_predictions = dt_model.transform(test_data)
#proceed with the model evaluation part after this

LightFM recommendation: Inconsistent error with interaction data

I have the following basic code with the LightFM recommendation module:
# Interactions
A=[0,1,2,3,4,4] # users
B=[0,0,1,2,2,3] # items
C=[1,1,1,1,1,1] # weights
matrix = sparse.coo_matrix((C,(A,B)),shape=(max(A)+1,max(B)+1))
# Create model
model = LightFM(loss='warp')
# Train model
model.fit(matrix, epochs=30)
# Predict
scores = model.predict(1, np.array([0,1,2,3]))
print(scores)
This returns the following error:
> C:\Program
> Files\Python\Python36\lib\site-packages\numpy\core\_methods.py:32:
> RuntimeWarning: invalid value encountered in reduce return
> umr_sum(a, axis, dtype, out, keepdims) Traceback (most recent call
> last): File "run.py", line 15, in <module>
> model.fit(matrix, epochs=100) File "C:\Program Files\Python\Python36\lib\site-packages\lightfm\lightfm.py", line 476,
> in fit
> verbose=verbose) File "C:\Program Files\Python\Python36\lib\site-packages\lightfm\lightfm.py", line 580,
> in fit_partial
> self._check_finite() File "C:\Program Files\Python\Python36\lib\site-packages\lightfm\lightfm.py", line 410,
> in _check_finite
> raise ValueError("Not all estimated parameters are finite," ValueError: Not all estimated parameters are finite, your model may
> have diverged. Try decreasing the learning rate or normalising feature
> values and sample weights
Strangely enough, making some changes in the interaction data makes it work, as with:
# Interactions
A=[0,1,2,3,4,4]
B=[0,0,1,2,2,10] # notice the 10 here
C=[1,1,1,1,1,1]
Could anyone help me with that please?

#Predict
scores = model.predict(1, np.array([0,1,2,3]))
print(scores)
[-0.17697991 -0.55117112 -0.37800685 -0.57664376]
It works fine for me, update the lightFM version?

CNTK python API: How to get predictions from the trained model?

I have a trained model which I am loading using CNTK.load_model() function. I was looking at the MNIST Tutorial on the CNTK git repo as reference for model evaluation code. I have created a data reader (which is a MinibatchSource object) and trying to run model.eval(mb) where mb = minibatch_source.next_minibatch(...) (Similar to this answer)
But, I'm getting the following error message
Traceback (most recent call last):
File "LID_test.py", line 162, in <module>
test_and_evaluate()
File "LID_test.py", line 159, in test_and_evaluate
predictions = model.eval(mb)
File "/home/t-asbahe/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/ops/functions.py", line 228, in eval
_, output_map = self.forward(arguments, self.outputs, device=device, as_numpy=as_numpy)
File "/home/t-asbahe/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/utils/swig_helper.py", line 62, in wrapper
result = f(*args, **kwds)
File "/home/t-asbahe/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/ops/functions.py", line 354, in forward
None, device)
File "/home/t-asbahe/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/utils/__init__.py", line 393, in sanitize_var_map
if len(arguments) < len(op_arguments):
TypeError: object of type 'Variable' has no len()
I have no input_variable named 'Variable' in my model and I don't see any reason to get this error.
P.S.: My inputs are sparse inputs (one-hots)

You have a few options:
Pass a set of data as numpy array (instance in CNTK 202 tutorial) where onehot data is passed in as a numpy array.
pred = model.eval({model.arguments[0]:[onehot]})
Read the minibatch data and pass it to the eval function
eval_input_map = { input : reader_eval.streams.features }
eval_data = reader_eval.next_minibatch(eval_minibatch_size,
input_map = eval_input_map)
mydata = eval_data[input].value
predicted= model.eval(mydata)

Memory Error when train TBL POS Tagger in Python

When I try to train a corpus having 40K sentences, there is no problem. But when I train 86K sentences, I get error like this:
ERROR:root:
Traceback (most recent call last):
File "CLC_POS_train.py", line 95, in main
train(sys.argv[10], encoding, flag_tagger, k, percent, eval_flag)
File "CLC_POS_train.py", line 49, in train
CLC_POS.process('TBL', train_data, test_data, flag_evaluate[1], flag_dump[1], 'pos_tbl.model' + postfix)
File "d:\WORKing\VCL\TEST\CongToan_POS\Source\CLC_POS.py", line 184, in process
tagger = CLC_POS.train_tbl(train_data)
File "d:\WORKing\VCL\TEST\CongToan_POS\Source\CLC_POS.py", line 71, in train_tbl
tbl_tagger = brill_trainer.BrillTaggerTrainer.train(trainer, train_data, max_rules=1000, min_score=3)
File "C:\Python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\brill_trainer.py", line 274, in train
self._init_mappings(test_sents, train_sents)
File "C:\Python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\brill_trainer.py", line 341, in _init_mappings
self._tag_positions[tag].append((sentnum, wordnum))
MemoryError
INFO:root:
I already used Python 3.5 in Windows 64-bit but still get this error.
This is the code used for training:
t0 = RegexpTagger(MyRegexp.create_regexp_tagger())
t1 = nltk.UnigramTagger(train_data, backoff=t0)
t2 = nltk.BigramTagger(train_data, backoff=t1)
trainer = brill_trainer.BrillTaggerTrainer(t2, brill.fntbl37())
tbl_tagger = brill_trainer.BrillTaggerTrainer.train(trainer, train_data, max_rules=1000, min_score=3)

This happened because your PC doesn't have enough RAM.
When you train your large corpus, it requires a lot of memory.
Install more RAM, then you can get it done.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

signed integer is greater than maximum in scikit-learn in python - python

Related

Re-fitting a saved scikit-learn model without some features not used - "ValueError: A given column is not a column of the dataframe"

fitting training data from decision tree regressor causes crash

LightFM recommendation: Inconsistent error with interaction data

CNTK python API: How to get predictions from the trained model?

Memory Error when train TBL POS Tagger in Python

Categories

Resources