I am trying to do some analysis on unigrams in Sci Kit Learn. I created files in svmlight format and tried to run MultinomialNB() KNeighborsClassifier() and SVC(). We I first tried to do that with unigrams I got a X training dimension error presumably because the only unigrams that are included in a given example are the ones that show up in the training fit there. I tried creating svmlight format training files that include place holders for every seen unigram in the corpus even those not in that given example.
The problem is that inflated the training files from 3 MB to 300 MB. This caused memory errors for sklearn loading the files. Is there a way to get around the dimension mismatches or memory overflows.
X_train, y_train= load_svmlight_file(trainFile)
x_test, y_test = load_svmlight_file(testFile)
try:
clf = MultinomialNB()
clf.fit(X_train, y_train)
preds = clf.predict(x_test)
print('Input data: ' + trainFile.split('.')[0])
print('naive_bayes')
print('accuracy: ' + str(accuracy_score(y_test, preds)))
if 1 in preds:
print('precision: ' + str(precision_score(y_test, preds)))
print('recall: ' + str(recall_score(y_test, preds)))
except Exception as inst:
print 'fail in NB ' + 'Input data: ' + trainFile.split('.')[0]
print str(inst)
pass
2828 test examples and 1212 test examples with 18000 distinct unigrams
EDIT I tried to use the sklearn CountVectorizer but I am still getting the memory issues. Is this the best way to do this?
def fileLoadForPipeline(trainSetFile, valSetFile):
with open(trainSetFile) as json_file:
tdata = json.load(json_file)
with open(valSetFile) as json_file:
vdata = json.load(json_file)
x_train = []
x_val = []
y_train = []
y_val = []
for t in tdata:
x_train.append(t['request_text'])
y_train.append(t['requester_received_pizza'])
for v in vdata:
x_val.append(t['request_text'])
y_val.append(t['requester_received_pizza'])
return x_train, y_train, x_val, y_val
def buildPipeline(trainset, valset, norm):
x_train, y_train, x_val, y_val = fileLoadForPipeline(trainset, valset)
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=ur'\b\w+\b', min_df=1)
xT = bigram_vectorizer.fit_transform(x_train).toarray()
xV = bigram_vectorizer.fit_transform(x_val).toarray()
if norm:
transformer = TfidfTransformer()
xT = transformer.fit_transform(xT)
xV = transformer.fit_transform(xV)
results = []
for clf, name in ((Perceptron(n_iter=50), "Perceptron"),
(KNeighborsClassifier(n_neighbors=40), "kNN"), (MultinomialNB), (MultinomialNB(alpha=.01),'MultinomialNB'),
(BernoulliNB(alpha=.1),'BernoulliNB'),(svm.SVC(class_weight='auto'),'svc')):
print 80 * '='
print name
results.append(benchmark(clf))
Try using scikit-learn's CountVectorizer which will do the feature extraction on raw text for you. Most importantly, the method fit_transform called on a set of training examples will automatically do the Bag of Words unigram transformation, where it keeps track of all n unique words found in the training corpus, and converts each document into an array of length n whose features can be either discrete word counts or binary presence features (depending on the binary option). The great thing about CountVectorizer is that it stores data in numpy sparse matrix format, which makes it very memory efficient, and should be able to solve any memory problems you're having.
You can then call transform on future testing examples, and it will do conversion like normal.
This should also help solve any dimensionality issues, as CountVectorizer's job is to regularize everything. Specific information on usage here:
http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage
An added benefit of this is that you can combine this vectorizer with a classifier using a Pipeline to make fitting and testing more convenient.
Related
This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.
I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.
train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)
X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)
model.score(X_test_vec, y_test)
However I'm getting an error on the last line of code when I run it in the notebook.
ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]
Full error message: https://imgur.com/a/GS7jEi5
you are using x_train for both train and test and is the reason you are getting the error.
try:
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one
As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I have a dataset in CSV format, 6 columns and 1877 rows. The full dataset can be viewed ShareCSV.
The first five columns are characteristics and the final column is a binary result, I want to create a classification network to predict result using the five inputs as seen in the CSV above.
I use the following code to normalize the data with pandas.
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv(r"D:\path\data.csv", sep=",")
df=(df-df.min())/(df.max()-df.min())
I now need to pass this data to scikit-learn and select a classification algorithm, however this is where I am unsure what would be optimal, if anyone could recommend the best algorithm for my data and a rough implementation that would be great.
Ι would say that you can separate the values by train_test_split and then train those on a classification algorithm by an appropriate metric.
Below is something that I used (for a regression problem though), that you may change to your own needs:
X = TRAIN_DS[["season", "holiday", "workingday", "weather", "weekday",
"month", "year", "hour", 'humidity', 'temperature']]
Y = TRAIN_DS['count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('randf', RandomForestRegressor(max_depth= 50, n_estimators= 1500)), ('gradb', GradientBoostingRegressor(max_depth= 5, n_estimators= 400)), ('gradb2',GradientBoostingRegressor(n_estimators= 4000)), ('svr', SVR('rbf',gamma='auto')), ('ext', ExtraTreesRegressor(n_estimators=4000))]
voting = StackingRegressor(estimators)
voting.fit(X = X_train, y = np.log1p(y_train))
For the best model I would suggest that you use an appropriate metric. Here is an RMSLE and R2 function that you may find useful:
'''Calculating RMSLE score, r2 score as well as plotting'''
def calc_plot(y_test, y_pred, name):
# Removing negative values for i, y in enumerate
(y_pred): if y_pred[i] < 0: y_pred[i] = 0
# Printing scoring
print('RMSLE for ' + name + ':', np.sqrt(mean_squared_log_error(y_test, y_pred)))
print('R2 for ' + name + ':', r2_score(y_test, y_pred))
Also you may use Voting Classifier or Stacking Classifier to use multiple models for your predictions.
Finally you can use GridSearchCV to check different values for the parameters of the classification algorithm that you use. An example to a regression problem would be the below:
gr = SGDRegressor()
parameters = {'loss':['squared_loss','huber','epsilon_insensitive','squared_epsilon_insensitive'], 'penalty':['l2','l1','elasticnet'],
'fit_intercept':[True,False], 'learning_rate':['constant','optimal','invscaling','adaptive'], 'alpha':[0.0001,0.005,0.001],
'l1_ratio':[0.15,0.5,0.25], 'max_iter':[500,1000,2000], 'epsilon':[0.1,0.4], 'eta0':[0.01,0.05,0.1], 'power_t':[0.25,0.1,0.5],
'early_stopping':[True,False], 'warm_start':[True,False],'average':[True,False], 'n_iter_no_change':[3,5,10,15]}
lModel = GridSearchCV(gr,parameters, cv=LeaveOneOut(), scoring = 'neg_mean_absolute_error')
Hope it helps!
This is my first attempt of document classification with ML and Python.
I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
Then I extract 500 articles not related to money laundering and also convert them to pandas df
I concatenate both dfs and label them either 'money-laundering' or 'other'
I do preprocessing (removing punctuation and stopwords, lower case etc)
and then feed the model based on bag of words principle as below:
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
text_features = vectorizer.fit_transform(full_df["processed full text"])
text_features = text_features.toarray()
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
accuracy_score(y_pred=y_pred, y_true=y_test)
It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:
ValueError: Number of features of the model must match the input. Model n_features is 5000 and input n_features is 45
I am not sure how to overcome this to be able to classify totally new article.
First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code.
Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test.
Then, let define full_added the corpus of texts with full_df and test.
text_features = vectorizer.fit_transform(full_added)
text_features = text_features.toarray()
You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])).
And then you can run
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(test)
Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.
Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.
I hope it helps.
My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.
The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:
#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values
#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)
#create classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)
#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction
output:
array(['money laundering'],
dtype='<U16')
And the accuracy is around 91% so more realistic than 99.96%..
Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.
I am trying to make a script that takes a json file(pizza-train.json) (from this Kaggle competition. I want to extract the request_text field from each dictionary in the list, and construct a bag of words representation of the string (string to count-list).
The next step is to train a logistic regression classifier to predict the variable “requester_received_pizza”. I want to train the 90% of the data and predict the 10%. The problem is that I don't know how to predict the 10%. Any advice would be really helpfull!
import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
f_json = json.load(open('pizza-train.json'))
request_text = []
y = []
for item in f_json[:100]:
request_text.append(item['request_text'])
y.append(item['requester_received_pizza'])
vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')
train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()
print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab
x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)
You might do it like this:
alg = sklearn.linear_model.LogisticRegression()
alg.fit(x_train, y_train)
test_score = alg.score(x_test, y_test)
You should read the sklearn docs logistic regression and cross validation, which are very good and provide more sophisticated methods for validating your models. This tutorial for the Kaggle Titanic competition might also be useful.