Inconsistent numbers of samples error from Python - python

I'm working on the Titanic competition on Spyder IDE. The code is barely complete but I'm doing it one step at a time (and this is the first time I've ever built a learning model). Now, I'm getting a Found input variables with inconsistent numbers of samples: [891, 183] error in the log while trying to run my code. This is what I have so far:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = train_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
Idk whether its coming from the excel file or the parameters. I'm sorry if this is a simple question. I couldn't implement other people's solutions.

The error is because you are taking labels from filtered data and taking x from unfiltered data
Change the following line
x = train_data[columns_of_interest]
to
x = filtered_titanic_data[columns_of_interest]

Related

AttributeError: 'DecisionTreeClassifier' object has no attribute 'precision_score'

i just recently started learning data science. this is what i wrote:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import precision_score, recall_score
import numpy as np
#reading data
df = pd.read_csv('titanic.csv')
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values
#spliting data into train/test
kf = KFold(n_splits=4+1, shuffle=True, random_state=10)
tree_scores = {'accuracy_scores':[],'precision_scores':[],'recall_scores':[]}
logistic_scores = {'accuracy_scores':[],'precision_scores':[],'recall_scores':[]}
#making the models
for train_indexes, test_indexes in kf.split(X):
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree_scores['accuracy_scores'].append(tree.score(X_test,y_test))
tree_prediction = tree.predict(X_test)
#tree_scores['precision_scores'].append(tree.precision_score(y_test,tree_prediction))
#tree_scores['recall_scores'].append(tree.recall_score(y_test,tree_prediction))
logistic = LogisticRegression()
logistic.fit(X_train,y_train)
logistic_scores['accuracy_scores'].append(logistic.score(X_test,y_test))
logistic_prediction = logistic.predict(X_test)
logistic_scores['precision_scores'].append(precision_score(y_test,logistic_prediction))
logistic_scores['recall_scores'].append(recall_score(y_test,logistic_prediction))
print("Decision Tree")
print(" accuracy:", np.mean(tree_scores['accuracy_scores']))
print(" precision:", np.mean(tree_scores['precision_scores']))
print(" recall:", np.mean(tree_scores['recall_scores']))
print("Logistic Regression")
print(" accuracy:", np.mean(logistic_scores['accuracy_scores']))
print(" precision:", np.mean(logistic_scores['precision_scores']))
print(" recall:", np.mean(logistic_scores['recall_scores']))
the two lines commented in for loop give error for both precision and recall, i dont know why. ALthough before when i was running both precision n recall they worked. and i cant seem to find any spelling mistake either.
i wonder if the different python syntaxes are messing with sklearn? because once i tried a combination like this:
X = df.loc['Plass':'Fare'].values
y = df.Survived.values
and it gave errors but when i used normal expected way it worked fine.
(note: the code may be wrongly indented, first time using stackexchange guys.)
DecisionTreeClassifier doesn't have such a method indeed.
You need to change:
tree_scores['precision_scores'].append(tree.precision_score(y_test,tree_prediction))
tree_scores['recall_scores'].append(tree.recall_score(y_test,tree_prediction))
to:
tree_scores['precision_scores'].append(precision_score(y_test,tree_prediction))
tree_scores['recall_scores'].append(recall_score(y_test,tree_prediction))
and you're fine to go

ValueError: dimension mismatch While Predicting New Values Sentiment Analysis

I am relatively new to the machine learning subject. I am trying to do sentiment analysis prediction.
Type column includes the sentiment of the tweet(pos, neg or neutral as 0,1 and 2). Tweet column includes the tweets.
I am trying to predict new set of tweets's sentiments as 0,1 and 2.
When I wrote the code given here I got dimension mismatch error.
import pandas as pd
train_tweets = pd.read_csv("tweets_type.csv")
from sklearn.model_selection import train_test_split
y = train_tweets.Type
X= train_tweets.Tweet
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(train_X)
train_X_dtm = vect.transform(train_X)
test_X_dtm = vect.transform(test_X)
test_X_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(train_X_dtm, train_y)
# make class predictions for X_test_dtm
y_pred_class = nb.predict(test_X_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
metrics.accuracy_score(test_y, y_pred_class)
march_tweets = pd.read_csv("march_data.csv")
X=march_tweets.Tweet
vect.fit(X)
train_new_dtm = vect.transform(X)
new_pred_class = nb.predict(train_new_dtm)
The error I am getting is here:
Would be so glad if you could help me.
It seems I made a mistake fitting X after I already fitted train_X. I found out there is no use of doing that repeatedly once you the model is fitted. So what I did is I removed this line and it worked perfectly.
vect.fit(X)

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.Problem with TPOT Automated Machine Learning in Python

When executing a sample code, I am encountering the following problem: "RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
The Problem with TPOT Automated Machine Learning in Python.
I am trying to make the example: Dataset 2: Mushroom Classification (https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9)
source code:
https://www.kaggle.com/discdiver/tpot-mushroom-classification-task/
I tried to change the position of tpot.fit (X_train, y_train), but it doesn't solve the problem.
Library
import time
import gc
import pandas as pd
import numpy as np
import seaborn as sns
import timeit
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5, palette="colorblind")
import category_encoders
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
# Read data
df_cogumelo = pd.read_csv('agaricus-lepiota.csv')
# Visualization
pd.options.display.max_columns = 200
pd.options.display.width = 200
# separate out X
X = df_cogumelo.reindex(columns=[x for x in df_cogumelo.columns.values if x != 'class'])
X = X.apply(LabelEncoder().fit_transform)
# separate out y
y = df_cogumelo.reindex(columns=['class'])
print(y['class'].value_counts())
y = np.ravel(y) # flatten the y array
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=10)
print(X_train.describe())
print("\n\n\n")
print(X_train.info())
# generation and population_size determine how many populations are made.
tpot = TPOTClassifier(verbosity=3,
scoring="accuracy",
random_state=10,
periodic_checkpoint_folder="tpot_mushroom_results",
n_jobs=-1,
generations=2,
population_size=10, use_dask=True) #use_dask=True
times = []
scores = []
winning_pipes = []
# run several fits
for x in range(10):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mushroom.py')
# output results
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)
#The expected result is as follows:
#https://www.kaggle.com/discdiver/tpot-#mushroom-classification-task/
Removing "use_dask=True" solved the error for me.
You problem is not the code it is your data. That mushroom dataset has not no header row. Go into the file and insert a new first row and label the columns (doens't matter what) making sure the last column is named 'class' (lowercase c). That should fix the problem. If you look at your output, when you print the y['class'] count you get None. If you already added the labels correctly, then please send the output stack trace.

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
https://en.wikipedia.org/wiki/Logistic_regression
Since you are using sklearn try:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Train-test split does not seem to work properly in Python?

I am trying to run a kNN (k-nearest neighbour) algorithm in Python.
The dataset I am using to try and do this is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine
Here is the code I am using:
#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here
#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']
#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()
#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)
#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
In section 5, when I run sklearn.model_selection to import the train-test split mechanism, this does not appear to be running correctly because it provides the shapes: (0,13) (0,178) (1,13) (1,178).
Then, upon trying to run the knn, I get the error message: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. This is not due to scaling with max-min normalisation as I still get this error message even when the variables are not scaled.
I'm not exactly sure where your code is going wrong, it's a slightly different way of going about it compared to the sklearn docs. However, I can show you a different way of getting the train test split to work on the wine dataset for you.
from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_wine(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

Categories

Resources