This is for a project that's due soon so help would be greatly appreciated, I've never done ML before so sorry if the mistake is an absolute smooth brain one.
I have a dataset that's a bunch of tweets along with personality scores, and I need to train an model to predict the scores.
This is what I've done so far by following a bunch of tutorials and stitching together what I learned.
train = pandas.read_csv('../dataset/cleaner_dataset.csv')
train['tweet'] = train['tweet'].str.lower()
train['tweet'] = train['tweet'].replace('[^a-zA-Z0-9]', ' ', regex = True)
X = train['tweet']
y = train['neuroticism']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
vectorizer = TfidfVectorizer(min_df=5)
X_test_vec = vectorizer.fit_transform(X_train)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_vectorized, y_train)
model.score(X_test_vec, y_test)
However I'm getting an error on the last line of code when I run it in the notebook.
ValueError: Found input variables with inconsistent numbers of samples: [495, 1980]
Full error message: https://imgur.com/a/GS7jEi5
you are using x_train for both train and test and is the reason you are getting the error.
try:
vectorizer = TfidfVectorizer(min_df=5)
X_vectorized = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test) # use the same vectorizer, do not define a new one
As pointed out below, we dont fit the test set.
BUT* you still need to use the X_test with y_test
Related
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I am trying to do the machine learning practice problem of Heart disease , dataset from kaggle.
Then i tried to split data into train set and test set and after that combing models into single function and predicting,this error shows up in jupyter notebook .
Here's my code:
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
Spliting
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
Prediction function
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier()}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
And when i run this code , that error shows up
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
This is error
Your X_train, y_train, or both, seem to have entries that are not float numbers.
At some point in the code, try using
X_train = X_train.astype(float)
y_train = y_train.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)
Either this will work and the error will go away, or one of the conversions will fail, at which point you will need to decide how (or if) you want to encode the data as a float.
I'm trying to learn the basics of XGBoost and devises a script that splits some data I found on Kaggle about Corona virus outbreaks in China. The code and model work, but some some reason when I use the model to make a new prediction I get a "ValueError: feature_names mismatch." The new test data has a 2-d array with 2 values, just like the test data, but I still get a value error.
train = df[['RegionCode','ProvinceCode']].astype(int)
test = df['infected'].astype(int)
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
param = {
'max_depth':4,
'eta':0.3,
'num_class': 2}
epochs = 10
model = xgb.train(param, train, epochs)
All the code above works, but the terst below gives me the error:
testArray=np.array([[13, 67]])
test_individual = xgb.DMatrix(testArray)
print(model.predict(test_individual))
Any idea what I'm doing wrong?
Seems like you are missing out on the basics of using the train_test_split function of sklearn.
X_test, X_train, y_test, y_train = train_test_split(train, test, test_size=0.2, random_state=42)
The line above expects the train to have all the features to be used for training, while the test expects the target feature.
Try fixing that first.
I'm trying to train a logistic classifier. My dataset has the following columns.
name , review, rating, reviews_cleaned , word_count, sentiment,
The sentiment is either +1 or -1 based on whether the rating is greater than 3 or less. The word count contains a dict of words with occurences and reviews_cleaned just strips off the reviews off punctuations.
This is my code to train a LogisticClassifier.
train_data, test_data = train_test_split(products, test_size = 0.2)
sentiment_model = LogisticRegression(penalty='l2', C=1)
sentiment_model.fit(products['sentiment'], products['word_count'])
I get the following error,
ValueError: Found input variables with inconsistent numbers of samples: [1, 166752]
PS: The equivalent statment using graphLab create is
sentiment_model = graphlab.logistic_classifier.create(train_data,
target = 'sentiment',
features=['word_count'],
validation_set=None)
What am I doing wrong?
Your training data looks like it's a 1-dimensional vector but sklearn requires it to be 2-dimensional - if you reshape it you should be okay. Also you make your train/test split but you're not actually using the data that you're producing (fit with train_data instead).
Using GraphLab in that course is very irritating to say the least. Give this a whirl:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('amazon_baby.csv', header = 0)
df.dropna(how="any", inplace= True)
products = df[df['rating'] != 3] #drop the products with 3-star rating
products['sentiment'] = products['rating'] >= 4
X_train, X_test, y_train, y_test = train_test_split(products['review'], products['sentiment'], test_size = .2 ,random_state = 0)
vect = CountVectorizer()
X_train = vect.fit_transform(X_train.values)
X_test = vect.transform(X_test.values)
model = LogisticRegression(penalty ='l2', C = 1)
model.fit(X_train, y_train)
I'm not sure what the direct translation between Sklearn/Pandas and GraphLab is, but this looks like it's what they are doing.
When I score the model, I get:
model.score(X_test, y_test)
> .93155480
Let me know what results you get or if this works for you.
This is my first attempt of document classification with ML and Python.
I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
Then I extract 500 articles not related to money laundering and also convert them to pandas df
I concatenate both dfs and label them either 'money-laundering' or 'other'
I do preprocessing (removing punctuation and stopwords, lower case etc)
and then feed the model based on bag of words principle as below:
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
text_features = vectorizer.fit_transform(full_df["processed full text"])
text_features = text_features.toarray()
labels = np.array(full_df['category'])
X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
accuracy_score(y_pred=y_pred, y_true=y_test)
It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:
ValueError: Number of features of the model must match the input. Model n_features is 5000 and input n_features is 45
I am not sure how to overcome this to be able to classify totally new article.
First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code.
Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test.
Then, let define full_added the corpus of texts with full_df and test.
text_features = vectorizer.fit_transform(full_added)
text_features = text_features.toarray()
You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])).
And then you can run
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(X_train, y_train)
y_pred = forest.predict(test)
Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.
Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.
I hope it helps.
My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.
The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:
#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values
#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)
#create classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)
#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction
output:
array(['money laundering'],
dtype='<U16')
And the accuracy is around 91% so more realistic than 99.96%..
Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.