creating pipeline for dictvectorizer and linearsvc in sklearn - python

I trained a LinearSVC classifier with a NER dataset (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) and would like for it to be able to predict new data. From what I have read I need to create and save the model as a pipeline to do this. I have been trying to do this based on other examples on SO but can't get it to work. How can I turn my existing model into a pipelined version?
The first code snippet saves, and the second is one of my attempts at making it into the pipeline but I get an 'str' object has no attribute 'items' error. I think it has to do with the to_dict process but don't know how to replicate this in a pipelined version, can anyone help.
dframe = pd.read_csv("ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
dframe.dropna(inplace=True)
dframe[dframe.isnull().any(axis=1)].size
x_df = dframe.drop(['Unnamed: 0', 'sentence_idx', 'tag'], axis=1)
vectorizer = DictVectorizer()
X = vectorizer.fit_transform(x_df.to_dict("records"))
y = dframe.tag.values
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
model = LinearSVC(loss="squared_hinge",C=0.5,class_weight='balanced',multi_class='ovr')
model.fit(x_train, y_train)
dump(model, 'filename.joblib')
dframe = pd.read_csv("ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
dframe.dropna(inplace=True)
dframe[dframe.isnull().any(axis=1)].size
x_df = dframe.drop(['Unnamed: 0', 'sentence_idx', 'tag'], axis=1)
y = dframe.tag.values
x_train, x_test, y_train, y_test = train_test_split(x_df, y, test_size=0.1, random_state=0)
pipe = Pipeline([('vectorizer', DictVectorizer(x_df.to_dict("records"))), ('model', LinearSVC)])
pipe.fit(x_train, y_train)

You have to adjust your second part like this:
dframe = pd.read_csv("ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
dframe.dropna(inplace=True)
dframe[dframe.isnull().any(axis=1)].size
x_df = dframe.drop(['Unnamed: 0', 'sentence_idx', 'tag'], axis=1)
y = dframe.tag.values
x_train, x_test, y_train, y_test = train_test_split(x_df.to_dict("records"), y, test_size=0.1, random_state=0)
pipe = Pipeline([('vectorizer', DictVectorizer()), ('model', LinearSVC(loss="squared_hinge",C=0.5,class_weight='balanced',multi_class='ovr'))])
pipe.fit(x_train, y_train)
You were trying to pass your DictVectorizer() your data in the parameters by using
DictVectorizer(x_df.to_dict("records"))
but that does not work. The only available parameters for the DictVectorizer can be found here in the documentation.
And the second mistake was that you tried to fit your DictVectorizer() in the pipeline with the data from x_df with
pipe.fit(x_train, y_train)
The problem here is that the x_train data will be given to your DictVectorizer(), but x_train is just the split x_df and earlier in your code without the pipeline, you provided the DictVectorizer() with the data in form of x_df.to_dict("records").
So you need to pass the same type of data also with your pipeline. Thats why I already split the x_df.to_dict("records") with the train_test_split() in the adjusted code, so that the vectorizer can process it.
Last thing is that you also forgot the brackets when defining your pipeline for the LinearSVC()
('model', LinearSVC)

Related

my standardscaler outputs ValueError: could not convert string to float

I am doing a project as a python and machine learning beginner and came across Titanic dataset. After splitting my dataset into training and testing, I wanted to normalize the x_train using StandardScaler, but this keeps coming out:
ValueError: could not convert string to float: 'PassengerId'
and this is my code
feature =df[['PassengerId', 'PClass', 'Age', 'SibSp', 'Parch']].values
target = df[['Survived']].values
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
How can I solve this?
# it's a good idea to exclude PassengerId, since it's probabaly not a predictive feature
# also convert all the values to float
feature =df[['PClass', 'Age', 'SibSp', 'Parch']].astype('float').values
target = df[['Survived']].values
x_train, x_test, y_train, y_test = train_test_split(feature, target, test_size=0.2)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

How to find out StandardScaling parameters .mean_ and .scale_ when using Column Transformer from Scikit-learn?

I want to apply StandardScaler only to the numerical parts of my dataset using the function sklearn.compose.ColumnTransformer, (the rest is already one-hot encoded). I would like to see .scale_ and .mean_ parameters fitted to the training data, but the function scaler.mean_ and scaler.scale_ obviously does not work when using a column transformer. Is there a way to do so?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
scaler = StandardScaler()
data_pipeline = ColumnTransformer([
('numerical', scaler, numerical_variables)], remainder='passthrough')
X_train = data_pipeline.fit_transform(X_train)
The fitted transformers are available in the attributes transformers_ (a list) and named_transformers_ (a dict-like with keys the names you provided). So, for example,
data_pipeline.named_transformers_['numerical'].mean_

I want to compare my prediction value with original train data

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.
Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

naive bayes classifier dynamic training

Is it possible (and how if it is) to dynamically train sklearn MultinomialNB Classifier?
I would like to train(update) my spam classifier every time I feed an email in it.
I want this (does not work):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
clf.fit([x_train[i]], [y_train[i]])
preds = clf.predict(x_test)
to have similar result as this (works OK):
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(x_train, y_train)
preds = clf.predict(x_test)
Scikit-learn supports incremental learning for multiple algorithms, including MultinomialNB. Check the docs here
You'll need to use the method partial_fit() instead of fit(), so your example code would look like:
x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)
clf = MultinomialNB()
for i in range(len(x_train)):
if i == 0:
clf.partial_fit([x_train[i]], [y_train[I]], classes=numpy.unique(y_train))
else:
clf.partial_fit([x_train[i]], [y_train[I]])
preds = clf.predict(x_test)
Edit: added the classes argument to partial_fit, as suggested by #BobWazowski

How do I accept a non-csv input for my machine learning model?

Language: Python.
I have created a model and saved it with joblib. Now I want to load it to make predictions for new data---but the data is in a form of string(numerical in value but the features are a single line separated by "," instead of in columns as one big dataframe) Can I do that? I know I can send in single inputs and get a single prediction but I'm not sure how to do it.
I used
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
as reference but I'm not too clear about the last bit (loading the model)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# save the model to disk
filename = 'test_model.sav'
joblib.dump(classifier, filename)
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, y_test)
print(result)
*I did not post the data preprocessing part of the code
If your problem is about how to load the input vector X_test from a string input, you can use np.fromstring:
input_string = '34,144,13'
X_test=np.fromstring(input_string, dtype=int, sep=',')
To get the model's prediction for the above X_test, you can use:
loaded_model = joblib.load(filename)
prediction= loaded_model.predict(X_test)

Categories

Resources