Most efficient way to pass one observation to sklearn classifier

Most efficient way to pass one observation to sklearn classifier - python

So after a year of arduous work, my model is finally being implemented in my company's productive servers.
In this productive environment, my model is loaded in a Python script and a string is pulled from another server. I now have to parse this string and pass it to the model so it can make a prediction and return that output to the end user.
My current concern is efficiency. I am looking for a very fast way to convert the string to an array-like object that can be passed to my model.
Here's a replicable example:
# Load modules
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
# Load dummy data and target
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']
# Initialize and fit classifier
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X, y)
# [1] New string is received
string = '17.99|10.38|122.8|1001.0|0.1184|0.2776|0.3001|0.1471|0.2419|0.07871|1.095|0.9053|8.589|153.4|0.006399|0.04904|0.05373|0.01587|0.03003|0.006193|25.38|17.33|184.6|2019.0|0.1622|0.6656|0.7119|0.2654|0.4601|0.1189'
# [2] Convert string to array-like structure
import numpy as np
x = np.array(string.split('|')).astype(float)
# [3] Pass `x` to `clf` and predict probability
clf.predict_proba(x.reshape(-1, 30)).item(0)
> 0.9987537665581022
My question
Is there a more efficient way to parse a string and pass it to an sklearn model?
I think skipping the import numpy would speed things up. However, I'm open to any solution that can improve the runtime of steps [1], [2] and [3].

make sure that you indeed need double precision
and use
fromstring = np.fromstring
# ...
fromstring(string, 'f', -1, '|')
it will be 3-4x faster than
np.array(string.split('|')).astype(float)

Related

Error due to different number of features in test and train sets after TF-IDF transform

I am trying to create an AI that reads my dataset and states whether an input outside the data is 1 or 0
My dataset has column for qualitative data and column for a boolean. Here is a sample from it:
Imports:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import re
import string
Open and cleaning dataset:
saisei_data = saisei_data.dropna(how='any',axis=0)
saisei_data = saisei_data.sample(frac=1)
X = saisei_data['Data']
y = saisei_data['Conscious']
saisei_data
Vectorisation:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(X_train)
xv_test = vectorization.fit_transform(X_test)
Example Algorithm - Logistic Regression:
LR = LogisticRegression()
LR.fit(xv_train,y_train)
pred_lr=LR.predict(xv_test) # Here is where I get an error
Everything works fine until I predict using the logistic regression algorithm.
The Error:
ValueError: X has 112 features per sample; expecting 23
This seems to change to similar errors such as:
ValueError: X has 92 features per sample; expecting 45
I am new to machine learning so I don't really know what I'm doing when it comes to using the algorithms, however I tried printing the xv_test variable and here is a sample of the output (also changes often):
Any ideas?

That is because you erroneously apply .fit_transform() to your test data; and, in this case, you are lucky enough that the process produces a programming error, thus alerting you that you are doing something methodologically wrong (which is not always the case).
We never apply either .fit() or .fit_transform() to unseen (test) data. The fitting should be done only once with the training data, like you have done here:
xv_train = vectorization.fit_transform(X_train)
For subsequent transformations of unseen (test) data, we use only .transform(). So, your next line should be
xv_test = vectorization.transform(X_test)
That way, the features in the test set will be the same with the ones in the training set, as it should be in the first place.
Notice the difference between the two methods in the docs (emphasis mine):
fit_transform:
Learn vocabulary and idf, return document-term matrix.
transform:
Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
and recall that we don't ever use the test set to learn anything.
So, simple general mnemonic rule, applicable practically everywhere:
The terms "fit" and "test data" are always (always...) incompatible; mixing them will create havoc.

How to convert a Python dictionary to a Numpy array?

So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array

It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
import numpy as np
x = np.array(list(b.values()))
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.

Converting a pandas Interval into a string (and back again)

I'm relatively new to Python and am trying to get some data prepped to train a RandomForest. For various reasons, we want the data to be discrete, so there are a few continuous variables that need to be discretized. I found qcut in pandas, which seems to do what I want - I can set a number of bins, and it will discretize the variable into that many bins, trying to keep the counts in each bin even.
However, the output of pandas.qcut is a list of Intervals, and the RandomForest classifier in scikit-learn needs a string. I found that I can convert an interval into a string by using .astype(str). Here's a quick example of what I'm doing:
import pandas as pd
from random import sample
vals = sample(range(0,100), 100)
cuts = pd.qcut(vals, q=5)
str_cuts = pd.qcut(vals, q=5).astype(str)
and then str_cuts is one of the variables passed into a random forest.
However, the intent of this system is to train a RandomForest, save it to a file, and then allow someone to load it at a later date and get a classification for a new test instance, that is not available at training time. And because the classifier was trained on discretized data, the new test instance will need to be discretized before it can be used. So what I want to be able to do is read in a new instance, apply the already-established discretization scheme to it, convert it to a string, and run it through the random forest. However, I'm getting hung up on the best way to 'apply the discretization scheme'.
Is there an easy way to handle this? I assume there's no straight-forward way to convert a string back into an Interval. I can get the list of all Interval values from the discretization (ex: cuts.unique()) and apply that at test-time, but that would require saving/loading a discretization dictionary alongside the random forest, which seems clunky, and I worry about running into issues trying to recreate a categorical variable (coming mostly from R, which is extremely particular about the format of categorical variables). Or is there another way around this that I'm not seeing?

Use the labelsargument in qcut and use pandas Categorical.
Either of those can help you create categories instead of interval for your variable. Then, you can use a form of encoding, for example Label Encoding or Ordinal Encoding to convert the categories (the factors if you're used to R) to numerical values which the Forest will be able to use.
Then the process goes :
cutting => categoricals => encoding
and you don't need to do it by hand anymore.
Lastly, some gradient boosted trees libraries have support for categorical variables though it's not a silver bullet and will depend on your goal. See catboost and lightgbm.

For future searchers, there are benefits to using transformers from scikit-learn instead of pandas. In this case, KBinsDiscretizer is the scikit equivalent of qcut.
It can be used in a pipeline, which will handle applying the previously-learned discretization to unseen data without the need for storing the discretization dictionary separately or round trip string conversion. Here's an example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer
pipeline = make_pipeline(KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'),
RandomForestClassifier())
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
If you really need to convert back and forth between pandas IntervalIndex and string, you'll probably need to do some parsing as described in this answer: https://stackoverflow.com/a/65296110/3945991 and either use FunctionTransformer or write your own Transformer for pipeline integration.

While it may not be the cleanest-looking method, converting a string back into an interval is indeed possible:
import pandas as pd
str_intervals = [i.replace("(","").replace("]", "").split(", ") for i in str_cuts]
original_cuts = [pd.Interval(float(i), float(j)) for i, j in str_intervals]

Bug with CalibratedClassifierCV when using a Pipeline with TF-IDF?

First of all thanks in advance, I don't really know if I should open an issue so I wanted to check if someone had faced this before.
So I'm having the following problem when using a CalibratedClassifierCV for text classification. I have an estimator which is a pipeline created this way (simple example):
# Import libraries first
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
# Now create the estimators: pipeline -> calibratedclassifier(pipeline)
pipeline = make_pipeline( TfidfVectorizer(), LogisticRegression() )
calibrated_pipeline = CalibratedClassifierCV( pipeline, cv=2 )
Now we can create a simple train set to check if the classifier works:
# Create text and labels arrays
text_array = np.array(['Why', 'is', 'this', 'happening'])
outputs = np.array([0,1,0,1])
When I try to fit the calibrated_pipeline object, I get this error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 4]
If you want I can copy the whole exception trace, but this should be easily reproducible. Thanks a lot in advance!
EDIT: I made a mistake when creating the arrays. Fixed now (Thanks #ogrisel !) Also, calling:
pipeline.fit(text_array, outputs)
works properly, but doing so with the calibrated classifier fails!

np.array(['Why', 'is', 'this', 'happening']).reshape(-1,1) is a 2D array of strings while the docstring of the fit_transform method of the TfidfVectorizer class states that it expects:
Parameters
----------
raw_documents : iterable
an iterable which yields either str, unicode or file objects
If you iterate over your 2D numpy array you get a sequence of 1D arrays of strings instead of strings directly:
>>> list(text_array)
[array(['Why'],
dtype='<U9'), array(['is'],
dtype='<U9'), array(['this'],
dtype='<U9'), array(['happening'],
dtype='<U9')]
So the fix is easy, just pass:
text_documents = ['Why', 'is', 'this', 'happening']
as the raw input to the vectorizer.
Edit: remark: LogisticRegression is almost always a well calibrated classifier by default. It will likely be the case that CalibratedClassifierCV won't bring anything in this case.

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.

You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.

You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)

try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.

I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()

I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.