I have a large multi dimensional unlabelled dataset of cars (price, mileage, horsepower, ...) for which I want to find outliers. I decided to use the sklearn OneClassSVM to build a decision boundary and have two main issues with my approach:
My dataset contains a lot of missing values. Is there a way to make the svm classify the data with missing features as an inlier if any possible values for the missing features would be an inlier?
I now want to add a feedback loop of manual moderated outliers. The manually moderated data should improve the classification of the SVM. I've read about the LabelSpreading model for semi-supervised learning. Would it be feasible to feed the classification output of the OneClassSVM to the LabelSpreading model and retrain this model when a sufficient amount of records are manually validated?
For the first question. You could use an sklearn.preprocessing.imputer to impute the missing values by mean or median:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
You could add some boolean Features that recode if some of the other features had NaNs. So if you have features X_1, X_2 you add the boolean features
X_1_was_NaN and X_2_was_NaN
that are 1 if X_1==NaN or X_2==NaN. If X is your original pd.DataFrame you can create this by
X = pd.DataFrame()
# Create your features here
# Get the locations of the NaNs
X_2 = 1.0 * X.isnull()
# Rename columns
X_2.rename(columns=lambda x: str(x)+"_has_NaN", inplace=True)
# Paste them together
X = pd.concat([X, X_2], axis=1)
Related
I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().
I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?
My code is currently working like the one below:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.
If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.
The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before
Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)
Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.
Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.
That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)
I am trying to calculate accuracy rate.
I have a pandas dataframe with numerous columns of data.
I have one column of predicted churns and one column of true churns for every customer.
Is there a way to calculate the accuracy metric and other metrics just between the two columns? Both columns are only binary of 0 as no churn and 1 as churn.
There is obviously many ways you can measure accuracy of a prediction against known answers. Since you tagged this with machine learning and python, I suggest using a confusion matrix (aka error matrix) as a first pass. The scikit-learn python library has a module that you can use:
from sklearn.metrics import confusion_matrix
y_true = ...
y_pred = ...
confusion_matrix( y_true, y_pred )
source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
I am currently doing an assignment for my data analysis course at uni. I manage to do the first two parts without many problems (EDA and text processing). I now need to do this:
Build a regression model that will predict the rating score of each product based on
attributes which correspond to some very common words used in the reviews
(selection of how many words is left to you as a decision). So, for each product you
will have a long(ish) vector of attributes based on how many times each word appears
in reviews of this product. Your target variable is the rating.
I find myself a bit lost on how to tackle this problem. Here is a link to the dataset I am using. Review2 is the lemmatized version of Review.
Any insight on how to solve this would be greatly appreciated!
P.S: I'm not posting here to get a full solution... Just a push in the right direction
EDIT:
This is the code I wrote for my Ordinal regression (would it be possible to have some feedback):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy'))
# Plot
pca = PCA(n_components = 1)
pca.fit(x_validate)
x_validate = pca.transform(x_validate)
plt.scatter(x_validate, y_validate, color='black')
plt.plot(x_validate, y_pred, color='blue', linewidth=1)
plt.show()
This is what the plot looks like (Took it from here):
Would it be possible to have some feedback on the code and, possible a better, more informative, way to plot the results (I don't really understand if the regression is performing well or not)
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating.
Let's pull this apart into several pieces!
So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product.
This is a bag-of-words model, meaning you will have to create a matrix representation (still held in a pd.DataFrame) of your words column or your review 2 column and there is a question asking how to do that here:
How to create a bag of words from a pandas dataframe
Below is a minimal example of how you can create that matrix with your Review2 column:
In [12]: import pandas as pd
In [13]: df = pd.DataFrame({"Review2":['banana apple mango', 'apple apple strawberry']})
In [14]: df
Out[14]:
Review2
0 banana apple mango
1 apple apple strawberry
In [15]: df.Review2.str.split()
Out[15]:
0 [banana, apple, mango]
1 [apple, apple, strawberry]
Name: Review2, dtype: object
In [16]: df = df.Review2.str.split().apply(pd.Series.value_counts) # this will produce the word count matrix
In [17]: df
Out[17]:
apple banana mango strawberry
0 1.0 1.0 1.0 NaN
1 2.0 NaN NaN 1.0
The bag-of-words model just counts how often a word occurs in a text of interest, with no regard for positions and represents a set of texts this way as a matrix where the texts are each represented by a row and the columns show the counts for all the words.
[...] based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision).
Now that you have your matrix representation (rows are the products, columns are the counts for each unique word), you can filter the matrix down to the most common words. I would encourage you to take a look at how the distribution of word counts looks. We will use seaborn for that and import it like so:
import seaborn as sns
Given that your pd.DataFrame holding the word-count matrix is called df, sns.distplot(df.sum()) should do the trick. Choose some cutoff that seems like it preserves a good chunk of the counts but doesn't include many words with low counts. It can be arbitrary and it doesn't really matter for now.
Your word count matrix is your input data, or also called the predictor variable. In machine learning this is often called the input matrix or vector X.
Your target variable is the rating.
The output variable or target variable is the rating column. In machine learning this is often called the output vector y (note that this can sometimes also be an output matrix but most commonly one outputs a vector).
This means our model tries adjust its parameters to map the word count data from each row to the corresponding rating value.
Scikit-learn offers a lot of machine learning models such as logistic regression which take an X and y for training and have a very unified interface. Jake Vanderplas's Python Data Science Handbook explains the Scikit-learn interface nicely and shows you how to use it for a regression problem.
EDIT: We are using logistic regression here but as correctly pointed out by fuglede, logistic regression ignores that ratings are on a set scale. For this you can use mord.OrdinalRidge, the API of which works very similarly to that of scikit-learn.
Before you train your model, you should split your data set into a training, a test and a validation set - a good ratio is probably 60:20:20.
This way you will be able to train your model on your training set and evaluate how well it is predicting your test data set, to help you tune your model parameters. This way you will know when your model is overfitting to your training data and when it is genuinely producing a good general model for this task. The problem with this approach is that you can still overfit onto your training data if you adjust model parameters often enough.
This is why we have a validation set - it is to make sure we are not accidentally also overfitting our model's parameters to both our training and test set without knowing it. We only test on the validation set once typically, so as to avoid overfitting to it, also - it is only used in the final model evaluation step.
Scikit-learn has a function for that, too: train_test_split
train_test_split however only makes one split, so you would first split your data set 60:40 and then the 40 you would split 50:50 into test and validation set.
You can now train different models on your training data and test them using the predict function of your model on your test set. Once you think you have done a good job and your model is good enough, you now test it on your validation set.
I have a training data set that has categorical features on which I use pd.get_dummies to one hot encode. This produces a data set with n features. I then train a classification model on this data set with n features. If I now get some new data with the same categorical features and again perform one hot encoding, the resultant number of features is m < n.
I cannot predict the classes of the new data set if the dimensions don't match with the original training data.
Is there a way to include all of the original n features in the new data set after one hot encoding?
EDIT: I am using sklearn.ensemble.RandomForestClassifier as my classification library.
For example ,
You have tradf with column ['A_1','A_2']
With your new df you have column['A'] but only have one category 1 , you can do
pd.get_dummies(df).reindex(columns=tradf.columns,fill_value=0)
When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?
SAMPLE_COUNT = 5000
TEST_COUNT = 20000
seed(0)
sample = list()
test_sample = list()
for index, line in enumerate(open('covtype.data','rb')):
if index < SAMPLE_COUNT:
sample.append(line)
else:
r = randint(0,index)
if r < SAMPLE_COUNT:
sample[r] = line
else:
k = randint(0,index)
if k < TEST_COUNT:
if len(test_sample) < TEST_COUNT:
test_sample.append(line)
else:
test_sample[k] = line
from sklearn.preprocessing import StandardScaler
for n, line in enumerate(sample):
sample[n] = map(float, line.strip().split(','))
y = np.array(sample)[:,-1]
scaling = StandardScaler()
X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform
for n,line in enumerate(test_sample):
test_sample[n] = map(float,line.strip().split(','))
yt = np.array(test_sample)[:,-1]
Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform
As the annotation says, why Xt only use transform but no fit?
We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data.
We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.
This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html
We have two datasets : The training and the test dataset. Imagine we have just 2 features :
'x1' and 'x2'.
Now consider this (A very hypothetical example):
A sample in the training data has values: 'x1' = 100 and 'x2' = 200
When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.
A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.
I have tried to explain the intuition behind this logic below:
We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature.
This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.
Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:
*learned parameters. Which in turn depend on the
*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the
*training data's mean and std.
If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.
Any transformation you do to the data must be done by the parameters generated by the training data.
Simply what fit() method does is create a model that extracts the various parameters from your training samples to do the neccessary transformation later on. transform() on the other hand is doing the actual transformation to the data itself returning a standardized or scaled form.
fit_transform() is just a faster way of doing the operations of fit() and transform() consequently.
Important thing here is that when you divide your dataset into train and test sets what you are trying to achieve is somewhat simulate a real world application. In a real world scenario you will only have training data and you will develop a model according to that and predict unseen instances of similar data.
If you transform the entrire data with fit_transform() and then split to train test you violate that simulation approach and do the transformation according to the unseen examples as well. Which will inevatibly result in an optimistic model as you already somewhat prepared your model by the unseen samples metrics as well.
If you split the data to train test and apply fit_transform() to both you will also be mistaken as your first transformation of train data will be done by train splits metrics only and your second will be done by test metrics only.
The right way to do these preprocessings is to train any transformer with train data only and do the transformations to the test data. Because only then you can be sure that your resulting model represents a real world solution.
Following this it actually doesnt matter if you
fit(train) then transform(train) then transform(test) OR
fit_transform(train) then transform(test)
fit() is used to compute the parameter needed for transformation and transform() is for scaling the data to convert into standard format for the model.
fit_tranform() is combination of two which is doing above work in efficiently.
Since fit_transform() is already computing and transforming the training data only transformation for testing data is left,since parameter needed for transformation is already computed and stored only transformation() of testing data is left therefor only transform() is used instead of fit_transform().
there could be two approaches:
1st approach scale with fit and transform train data, transform only test data
2nd fit and transform the whole set :train + test
if you think about: how will the model handle scaling when goes live?: When new data arrives, new data will behave just like the unseen test data in your backtest.
In the 1st case , new data will will just be scale transformed and your model backtest scaled values remain unchanged.
But in the 2nd case when new data comes then you will need to fit transform the whole dataset , that means that the backtest scaled values will no longer be the same and then you need to re-train the model..if this task can be done quickly then I guess it is ok
but the 1st case requires less work...
and if there are big differences between scaling in train and test then probably the data is non stationary and ML is probably not a good idea
fit() and transform() are the two methods used to generally account for the missing values in the dataset.The missing values can be filled either by computing the mean or the median of the data and filling that empty places with that mean or median.
fit() is used to calculate the mean or the median.
transform() is used to fill in missing values with the calculated mean or the median.
fit_tranform() performs the above 2 tasks in a single stretch.
fit_transform() is used for the training data to perform the above.When it comes to validation set only transform() is required since you dont want to change the way you handle missing values when it comes to the validation set, because by doing so you may take your model by surprise!! and hence it may fail to perform as expected.
we use fit() or fit_transform() in order to learn (to train the model) on the train data set. transform() can be used on the trained model against the test data set.
fit_transform() - learn the parameter of scaling (Train data)
transform() - Apply those learned scaling method here (Test data)
ss = StandardScaler()
X_train = ss.fit_transform(X_train) #here we need to feed this to the model to learn so it will learn the parameter of scaling
X_test = ss.transform(X_test) #It will use the learn parameter to transform