I am trying to experiment with sentiment analysis case and I am trying to run a random classifier for the following:
|Topic |value|label|
|Apples are great |-0.99|0 |
|Balloon is red |-0.98|1 |
|cars are running |-0.93|0 |
|dear diary |0.8 |1 |
|elephant is huge |0.91 |1 |
|facebook is great |0.97 |0 |
after splitting it into train test from sklearn library,
I am doing the following for the Topic column for the count vectoriser to work upon it:
x = train.iloc[:,0:2]
#except for alphabets removing all punctuations
x.replace("[^a-zA-Z]"," ",regex=True, inplace=True)
#convert to lower case
x = x.apply(lambda a: a.astype(str).str.lower())
x.head(2)
After that I apply countvectorizer to the topics column, convert it together with value column and apply Random classifier.
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['compound'], pd.DataFrame(traindataset)], axis=1)
# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,train['label'])
But I receive an error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'csr_matrix'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-41-7a1f9b292921> in <module>()
1 # implement RandomForest Classifier
2 randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
----> 3 randomclassifier.fit(train_set,train['label'])
4 frames
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: setting an array element with a sequence.
My idea is:
The values I received are from applying vader-sentiment and I want to apply that too - to my random classifier to see the impact of vader scores on the output.
Maybe is there a way to multiply the data in the value column with sparse matrix traindata generated
Can anyone please tell me how to do that in this case.
The issue is concatenating another column to sparse matrix (the output from countvector.fit_transform ). For simplicity sake, let's say your training is:
x = pd.DataFrame({'Topics':['Apples are great','Balloon is red','cars are running',
'dear diary','elephant is huge','facebook is great'],
'value':[-0.99,-0.98,-0.93,0.8,0.91,0.97,],
'label':[0,1,0,1,1,0]})
You can see this gives you something weird:
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(x['Topics'])
train_set = pd.concat([x['value'], pd.DataFrame(traindataset)], axis=1)
train_set.head(2)
value 0
0 -0.99 (0, 0)\t1\n (0, 1)\t1
1 -0.98 (0, 3)\t1\n (0, 10)\t1
It is possible to convert your sparse to a dense numpy array and then your pandas dataframe will work, however if your dataset is huge this is extremely costly. To keep it as sparse, you can do:
from scipy import sparse
train_set = scipy.sparse.hstack([sparse.csr_matrix(x['value']).reshape(-1,1),traindataset])
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(train_set,x['label'])
Check out also the help page for sparse
Related
Here is my code and the error: The columns unique values for sex are: male,female and Embarked are: S,C,Q,nan.
Code:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
def l_e(df):
df['Embarked']= label_encoder.fit_transform(df['Embarked'])
df['Sex']= label_encoder.fit_transform(df['Sex'])
train = l_e(train)
test = l_e(test)
train
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_17/4261505711.py in <module>
3 def l_e(df):
4 df['Sex']= label_encoder.fit_transform(df['Sex'])
----> 5 train = l_e(train)
6 test = l_e(test)
7
/tmp/ipykernel_17/4261505711.py in l_e(df)
2 label_encoder = preprocessing.LabelEncoder()
3 def l_e(df):
----> 4 df['Sex']= label_encoder.fit_transform(df['Sex'])
5 train = l_e(train)
6 test = l_e(test)
TypeError: 'NoneType' object is not subscriptable
This error comes from your variable being None. Make sure that train and test actually contain a DataFrame.
Also, note that your function doesn't return anything, so you need to apply it to the two dataframes directly, since they will be modified inside of the function, in place:
l_e(train)
l_e(test)
However, it seems unlikely that you want to use the same LabelEncoder for the two features Embarked and Sex. You should have one encode per feature. Furthermore, you are using fit_transform on both the train and test set, which is also probably not what you want to do, because that means the encodings of the labels might be different between train and test. Overall, I suggest you don't actually use a function, but rewrite your code like this:
from sklearn import preprocessing
label_encoder_embarked = preprocessing.LabelEncoder()
label_encoder_sex = preprocessing.LabelEncoder()
# Preprocessing
train['Embarked'] = label_encoder_embarked.fit_transform(train['Embarked'])
train['Sex'] = label_encoder_sex.fit_transform(train['Sex'])
test['Embarked'] = label_encoder_embarked.transform(test['Embarked'])
test['Sex'] = label_encoder_sex.fit_transform(test['Sex'])
According to sklearn documentation you should pass an array-like of shape (n_samples,) to fit_transform() method, so df['Embarked'] should match the required type. The error message says that the passed argument is of type None.
What is the train variable that you are passing to l_f function?
In addition, your l_e function doesn't return anything, so it imiplicitly returns None which will raise another exception in line test = l_e(test).
I have a dataset of 284 features I am trying to impute using scikit-learn, however I get an error where the number of features changes to 283:
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
imputer = imputer.fit(data.iloc[:,0:284])
df[:,0:284] = imputer.transform(df[:,0:284])
X = MinMaxScaler().fit_transform(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-150-849be5be8fcb> in <module>
1 imputer = SimpleImputer(missing_values = np.nan, strategy = "mean")
2 imputer = imputer.fit(data.iloc[:,0:284])
----> 3 df[:,0:284] = imputer.transform(df[:,0:284])
4 X = MinMaxScaler().fit_transform(df)
~\Anaconda3\envs\environment\lib\site-packages\sklearn\impute\_base.py in transform(self, X)
411 if X.shape[1] != statistics.shape[0]:
412 raise ValueError("X has %d features per sample, expected %d"
--> 413 % (X.shape[1], self.statistics_.shape[0]))
414
415 # Delete the invalid columns if strategy is not constant
ValueError: X has 283 features per sample, expected 284
I don't understand how this is reaching 283 features, I assume on fitting it's finding features that have all 0s or something and deciding to drop that, but I can't find documentation which tells me how to make sure those features are still kept. I am not a programmer so not sure if I am missing something else that's obvious or if I'm better looking into another method?
This could happen if you have a feature without any values, from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html:
'Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”'.
You can tell if this is indeed the problem by using a high 'verbose' value when constructing the imputer:
sklearn.impute.SimpleImputer(..., verbose=100,...)
It will spit sth like:
UserWarning: Deleting features without observed values: [ ... ]
I was dealing with the same situation and i got my solution by adding this transformation before the SimpleImputer mean strategy
imputer = SimpleImputer(strategy = 'constant', fill_value = 0)
df_prepared_to_mean_or_anything_else = imputer.fit_transform(previous_df)
What does it do? Fills everything missing with the value specified on parameter fill_value
I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,
Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])
Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
When I want to upload the model in Sagemaker so I can perform next operation like
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)
I am getting this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
3 buf.seek(0)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
99 labels.shape, array.shape))
--> 100 resolved_label_type = _resolve_type(labels.dtype)
101 resolved_type = _resolve_type(array.dtype)
102
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
205 elif dtype == np.dtype('float32'):
206 return 'Float32'
--> 207 raise ValueError('Unsupported dtype {} on array'.format(dtype))
ValueError: Unsupported dtype object on array
Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.
The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn
The output of TfidfVectorizer is a scipy sparse matrix, not a simple numpy array.
So either use a different function like:
write_spmatrix_to_sparse_tensor
"""Writes a scipy sparse matrix to a sparse tensor"""
See this issue for more details.
OR first convert the output of TfidfVectorizer to a dense numpy array and then use your above code
xtrain_tfidf = tfidf_vect.transform(X_train).toarray()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...
I am trying to write a numpy.ndarray as the labels for Amazon Sagemaker's conversion tool: write_numpy_to_dense_tensor(). It converts a numpy array of features and labels to a RecordIO for better use of Sagemaker algorithms.
However, if I try to pass a multilabel output for the labels, I get an error stating it can only be a vector (i.e. a scalar for every feature row).
Is there any way of having multiple values in the label? This is useful for multidimensional regressions which can be achieved with XGBoost, Random Forests, Neural Networks, etc.
Code
import sagemaker.amazon.common as smac
print("Types: {}, {}".format(type(X_train), type(y_train)))
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
Output:
Types: <class 'numpy.ndarray'>, <class 'numpy.ndarray'>
X_train shape: (9919, 2684)
y_train shape: (9919, 20)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-fc1033b7e309> in <module>()
3 print("y_train shape: {}".format(y_train.shape))
4 f = io.BytesIO()
----> 5 smac.write_numpy_to_dense_tensor(f, X_train.astype('float32'), y_train.astype('float32'))
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
94 if labels is not None:
95 if not len(labels.shape) == 1:
---> 96 raise ValueError("Labels must be a Vector")
97 if labels.shape[0] not in array.shape:
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
ValueError: Labels must be a Vector
Tom, XGBoost does not support RecordIO format. It only supports csv and libsvm. Also, the algorithm itself doesn’t natively support multi-label. But there are a couple of ways around it: Xg boost for multilabel classification?
Random Cut Forest does not support multiple labels either. If more than one label is provided it picks up the first only.
I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?
please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?
What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.
The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...
t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.