Text Classification/Text vectorization - python

my goal is to do text classification using ML supervised algorithms. I'm at the stage where I need to make my words so that computer would understand them. I'm trying vectorize method but I get error 'Series' object has no attribute 'lower. Is there other solution to prepare data for sentiment analysis? Or i'm going the right path and just need to find out how to vectorize words? My data is shown in picture and code is below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn import preprocessing
#tostr = svietimas_data['text'].astype(str).tolist()
print(svietimas_data)
tfidf = TfidfVectorizer(max_features=3000)
X = svietimas_data['text']
y = svietimas_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X = tfidf.fit_transform([X])

Related

How to get name of selected features when there are several feature selection methods in sklearn pipeline?

I want to use several feature selection methods in a sklearn pipeline as below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
I want to get name or column index of selected features. The point is that the 2nd feature selection step gets the output of the 1st feature selection step (not original X_train). Therefore, when I use methods like get_support() or get_feature_names_out() for the 2nd feature selection step, the feature names or indices don't match with the original input features.
vt = model['vt']
vt.get_feature_names_out()
vt.get_support()
kbest = model['kbest']
kbest.get_feature_names_out()
kbest.get_support()
For example, when I run vt.get_support(), I get an array of boolean with 30 entires. But, when I run kbest.get_support(), I get an array of boolean with 14 entires. It means that the name or column index of data input to the 2nd feature selection method was reset and there is no match with input data to the 1st feature selction method.
How to solve this issue?
In case it is enough for you to get the names of the selected features without caring about which features are selected in which step**, the following might be an easy way to go.
You can just return your input X as a dataframe via the parameter as_frame set to True (X, y = load_breast_cancer(return_X_y=True, as_frame=True)). This will allow you to get feature names as strings, which in turn allows method .get_feature_names_out() to return the selected features with the original names. The same does not happen in case you work with a numpy array as they do not have explicit column names.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
model[:-1].get_feature_names_out()
** btw this will enable you to get the original name of the selected features also for the first transformer, but unfortunately not for the second one as the dataframe becomes a numpy array in the meanwhile.
vt = model['vt']
vt.get_feature_names_out()

How to know from which interval of the input the features used in sktime's TimeSeriesForestClassifier are calculated

I used the sktime library's TimeSeriesForestClassifier class to perform multivariate time series classification.
The code is as follows
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
I would like to check the value of feature_importances_, which is not the same length as the input, but an array with the same length as the number of features.
clf.steps[1][1].feature_importances_
I would like to know which part of the input each importance corresponds to. Is there any way to get information about which section of the input the TimeSeriesForestClassifier is calculating features from?
You can get the intervals (start and end index) for each tree of the ensemble from:
clf.steps[1][1].intervals_
sktime now also has an implementation of the newer Canonical Interval Forecast.
When we first implemented the Time Series Forest algorithm, we ended up with two versions. The one that you're using is the recommended one, but the older version provides its own functionality for the feature importance graph (see below).
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.compose import ComposableTimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
steps = [
("concatenate", ColumnConcatenator()),
("classify", ComposableTimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
clf.steps[-1][-1].feature_importances_.rename(columns={"_slope": "slope"}).plot(xlabel="time", ylabel="feature importance")
Be aware of some subtle issues in the calculation and interpretation of the feature importances. The relevant issues are here:
https://github.com/alan-turing-institute/sktime/issues/214
https://github.com/alan-turing-institute/sktime/issues/669

sklearn classifier get ValueError: bad input shape (3529, 12)

I have a json file that file have preprocess data at the same time that data is also change vector.then how to train the data using SVM classification method
Vector is one name of the column
another one is values, values have genres of vector column
import pickle
from nltk.corpus import stopwords
import string
from nltk.stem import SnowballStemmer
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import metrics
stopwords=set(stopwords.words("english"))
exclude = set(string.punctuation)
snow=SnowballStemmer("english")
tvec = pickle.load(open("dataPackage/tfidf.pickle", 'rb'))
data=pd.read_json("dataPackage/finalData.json",orient = 'split')
inputLen = len(data["Vector"].iloc[0])
X = list(data["Vector"])
y = list(data.drop(["Vector"],axis = 1).values)
np.shape(X)
np.shape(y)
X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y), test_size=0.3,random_state=109)
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

How to use save SVM model for prediction

Referring to the post at How to use save model for prediction in python
when I load and predict with the new data..I am getting the following error.
is there anything we can do to resolve it?
UnicodeEncodeError: 'decimal' codec can't encode character u'\u2019' in position 510: invalid decimal Unicode string
my Entire code....
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['IssueDetails'], df['CRST'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, y_train)
cif_svm = Pipeline([('tfidf', tfidf_transformer), ('SVC', clf)])
from sklearn.externals import joblib
joblib.dump(cif_svm, 'modelsvm.pk1')
Fitmodel = joblib.load('modelsvm.pk1')
Fitmodel.predict(df_v)
I found the answer for my question above. I used the below code for prediction
datad['CRSTS']=datad['Detail'].apply(lambda x: unicode(clf.predict(count_vect.transform([x]))))

How to do PCA and SVM for classification in python

I am doing classification, and I have a list with two sizes like this;
Data=[list1,list2]
list1 is 1000*784 size. It means that 1000 images the have been reshaped from 28*28 size into 784.
list2 is 1000*1 size. It shows the label that each images is belonged to.
With the below code, I applied PCA:
from matplotlib.mlab import PCA
results = PCA(Data[0])
the output is like this:
Out[40]: <matplotlib.mlab.PCA instance at 0x7f301d58c638>
now, I want to use SVM as classifier.
I should add the labels. So I have the new data like this for SVm:
newData=[results,Data[1]]
I do not know how use SVM here.
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn import cross_validation
Data=[list1,list2]
X = Data[0]
y = Data[1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
pca = PCA(n_components=2)# adjust yourself
pca.fit(X_train)
X_t_train = pca.transform(X_train)
X_t_test = pca.transform(X_test)
clf = SVC()
clf.fit(X_t_train, y_train)
print 'score', clf.score(X_t_test, y_test)
print 'pred label', clf.predict(X_t_test)
Here is an tested code on another dataset.
import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn import cross_validation
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
pca = PCA(n_components=2)# adjust yourself
pca.fit(X_train)
X_t_train = pca.transform(X_train)
X_t_test = pca.transform(X_test)
clf = SVC()
clf.fit(X_t_train, y_train)
print 'score', clf.score(X_t_test, y_test)
print 'pred label', clf.predict(X_t_test)
Based on these references:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
http://scikit-learn.org/stable/modules/cross_validation.html
I think what you are looking for is http://scikit-learn.org/. It's a python library where you'll find PCA, SVM and other cool algorithms for Machine Learning. It has a good tutorial, but I recommend you follow this guy's http://www.astroml.org/sklearn_tutorial/general_concepts.html . For your particular question, the SVM page of scikit-learn should suffice http://scikit-learn.org/stable/modules/svm.html.

Categories

Resources