AWS Sagemaker | how to train text data | For ticket classification

AWS Sagemaker | how to train text data | For ticket classification - python

I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,
Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])
Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
When I want to upload the model in Sagemaker so I can perform next operation like
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)
I am getting this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
3 buf.seek(0)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
99 labels.shape, array.shape))
--> 100 resolved_label_type = _resolve_type(labels.dtype)
101 resolved_type = _resolve_type(array.dtype)
102
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
205 elif dtype == np.dtype('float32'):
206 return 'Float32'
--> 207 raise ValueError('Unsupported dtype {} on array'.format(dtype))
ValueError: Unsupported dtype object on array
Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.
The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn

The output of TfidfVectorizer is a scipy sparse matrix, not a simple numpy array.
So either use a different function like:
write_spmatrix_to_sparse_tensor
"""Writes a scipy sparse matrix to a sparse tensor"""
See this issue for more details.
OR first convert the output of TfidfVectorizer to a dense numpy array and then use your above code
xtrain_tfidf = tfidf_vect.transform(X_train).toarray()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...

Related

How to run a random classifer in the following case

I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following:
|Topic |value|label|
|Apples are great |-0.99|0 |
|Balloon is red |-0.98|1 |
|cars are running |-0.93|0 |
|dear diary |0.8 |1 |
|elephant is huge |0.91 |1 |
|facebook is great |0.97 |0 |
after splitting it into train test from sklearn library,
I am doing the following for the Topic column for the count vectoriser to work upon it:
x = train.iloc[:,0:2]
#except for alphabets removing all punctuations
x.replace("[^a-zA-Z]"," ",regex=True, inplace=True)
#convert to lower case
x = x.apply(lambda a: a.astype(str).str.lower())
x.head(2)
After that I apply countvectorizer to the topics column, convert it together with value column and apply Random classifier.
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['compound'], pd.DataFrame(traindataset)], axis=1)
# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,train['label'])
But I receive an error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-41-7a1f9b292921> in <module>()
1 # implement RandomForest Classifier
2 randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
----> 3 randomclassifier.fit(train_set,train['label'])
4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: setting an array element with a sequence.
My idea is:
The values I received are from applying vader-sentiment and I want to apply that too - to my random classifier to see the impact of vader scores on the output.
Maybe is there a way to multiply the data in the value column with sparse matrix traindata generated
Can anyone please tell me how to do that in this case.

The issue is concatenating another column to sparse matrix (the output from countvector.fit_transform ). For simplicity sake, let's say your training is:
x = pd.DataFrame({'Topics':['Apples are great','Balloon is red','cars are running',
'dear diary','elephant is huge','facebook is great'],
'value':[-0.99,-0.98,-0.93,0.8,0.91,0.97,],
'label':[0,1,0,1,1,0]})
You can see this gives you something weird:
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['value'], pd.DataFrame(traindataset)], axis=1)
train_set.head(2)
value 0
0 -0.99 (0, 0)\t1\n (0, 1)\t1
1 -0.98 (0, 3)\t1\n (0, 10)\t1
It is possible to convert your sparse to a dense numpy array and then your pandas dataframe will work, however if your dataset is huge this is extremely costly. To keep it as sparse, you can do:
from scipy import sparse
train_set = scipy.sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset])
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,x['label'])
Check out also the help page for sparse

Using numpy.ndarray type (multilabel) for labels in Sagemaker RecordIO format?

I am trying to write a numpy.ndarray as the labels for Amazon Sagemaker's conversion tool: write_numpy_to_dense_tensor(). It converts a numpy array of features and labels to a RecordIO for better use of Sagemaker algorithms.
However, if I try to pass a multilabel output for the labels, I get an error stating it can only be a vector (i.e. a scalar for every feature row).
Is there any way of having multiple values in the label? This is useful for multidimensional regressions which can be achieved with XGBoost, Random Forests, Neural Networks, etc.
Code
import sagemaker.amazon.common as smac
print("Types: {}, {}".format(type(X_train), type(y_train)))
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
Output:
Types: <class 'numpy.ndarray'>, <class 'numpy.ndarray'>
X_train shape: (9919, 2684)
y_train shape: (9919, 20)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-fc1033b7e309> in <module>()
3 print("y_train shape: {}".format(y_train.shape))
4 f = io.BytesIO()
----> 5 smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
94 if labels is not None:
95 if not len(labels.shape) == 1:
---> 96 raise ValueError("Labels must be a Vector")
97 if labels.shape[0] not in array.shape:
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
ValueError: Labels must be a Vector

Tom, XGBoost does not support RecordIO format. It only supports csv and libsvm. Also, the algorithm itself doesn’t natively support multi-label. But there are a couple of ways around it: Xg boost for multilabel classification?
Random Cut Forest does not support multiple labels either. If more than one label is provided it picks up the first only.

NotFittedError: Estimator not fitted, call `fit` before exploiting the model

I am running Python 3.5.2 on a Macbook OSX 10.2.1 (Sierra).
While attempting to run some code for the Titanic Dataset from Kaggle, I keep getting the following error:
NotFittedError Traceback (most recent call
last) in ()
6
7 # Make your prediction using the test set and print them.
----> 8 my_prediction = my_tree_one.predict(test_features)
9 print(my_prediction)
10
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in predict(self, X, check_input)
429 """
430
--> 431 X = self._validate_X_predict(X, check_input)
432 proba = self.tree_.predict(X)
433 n_samples = X.shape[0]
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/tree/tree.py
in _validate_X_predict(self, X, check_input)
386 """Validate X whenever one tries to predict, apply, predict_proba"""
387 if self.tree_ is None:
--> 388 raise NotFittedError("Estimator not fitted, "
389 "call fit before exploiting the model.")
390
NotFittedError: Estimator not fitted, call fit before exploiting the
model.
The offending code seems to be this:
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)
# Check that your data frame has 418 entries
print(my_solution.shape)
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])
And here is a link to the rest of the code.
Since I already have called the 'fit' function, I cannot understand this error message. Where am I going wrong? Thanks for your time.
Edit:
Turns out that the problem is inherited from the previous block of code.
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))
With the line:
my_tree_one = my_tree_one.fit(features_one, target)
generating the error:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').

The error is self explanatory: either the features_one or the target arrays do contain NaNs or infinite values, so the estimator fails to fit and therefore you cannot use it for prediction later.
Check those arrays and treat NaN values accordingly before fitting.

Python, ValueError, BroadCast Error with SKLearn Preproccesing

I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?

please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?

What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.

The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...

t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.

Problems while using ScikitLearn's Neural Network implementation

I am trying to implement image processing using Neural Network implementation given by Scikit Learn. I have close to 10,000 color images in 'JPG' format, I converted those images into 'PNG' format and removed the color information. The new images are all Black OR White images. After converting these images into vector format, these image vectors formed the input to the Neural Network.
To each image, there is an output as well which forms the output of the Neural Network.
The input file only has values of 0's and 1's and nothing else at all. The output for each image corresponds to a vector which is continuous, between 0 and 1 and is 22 in length. i.e. each image's output is a vector with length 22.
To start off with the processing, I began with only 100 images and their corresponding outputs and got the following error:
ValueError: Array contains NaN or infinity
I would also like to point out that the first iteration was completed here and I encountered this error during the second iteration.
To try something different, I trimmed my input and output down to 10 images each. Using the same piece of code (coming up shortly), I was able to complete 7 iterations (I had set the number of iterations to 20) and then received the same error.
I then changed the number of iterations to 5, just to check if it works. After this change, I got the following error:
ValueError: bad input shape (10, 22)
I also tried to use np.reval() on my input and output but that gave me NaN or Infinity error again.
Here is the code I am using for the whole process:
import numpy as np
import csv
import matplotlib.pyplot as plt
from scipy.ndimage import convolve
from sklearn import linear_model, datasets, metrics
from sklearn.cross_validation import train_test_split
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
def ReadCsv(fileName):
in_file = open(fileName, 'rUb')
reader = csv.reader(in_file, delimiter=',', quotechar='"')
data = [[]]
for row in reader:
data.append(row)
data.pop(0)
return data
X_train = np.asarray(ReadCsv('100Images.csv'), 'float32')
Y_train = np.asarray(ReadCsv('100Images_Y_new.csv'), 'float32')
X_test = np.asarray(ReadCsv('ImagesForTest.csv'), 'float32')
Y_test = np.asarray(ReadCsv('ImagesForTest_Y_new.csv'), 'float32')
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
rbm.learning_rate = 0.06
rbm.n_iter = 5
rbm.n_components = 100
logistic.C = 6000.0
classifier.fit(X_train, Y_train)
print()
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
Y_test,
classifier.predict(X_test))))
I would really appreciate any kind of help on this issue.
TIA.

Change learning rate to a small value might fix this issue. (i.e rbm.learning_rate)
At least this fixed the problem I had before.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AWS Sagemaker | how to train text data | For ticket classification - python

Related

How to run a random classifer in the following case

Using numpy.ndarray type (multilabel) for labels in Sagemaker RecordIO format?

NotFittedError: Estimator not fitted, call `fit` before exploiting the model

Python, ValueError, BroadCast Error with SKLearn Preproccesing

Problems while using ScikitLearn's Neural Network implementation

Categories

Resources