How to predict data from scikit-learn toy dataset - python

I am studying machine learning and I am trying to analyze the scikit diabetes toy database. In this case, I want to change the default Bunch object to a pandas DataFrame object. I tried using the argument as_frame=True and it did actually change the object type to DataFrame.
So after that, I trained the data and the problems come when I'm trying to plot it:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
dataset = datasets.load_diabetes(as_frame=True)
X = dataset.data
y = dataset.target
y = y.to_frame()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
regressor = linear_model.LinearRegression()
regressor.fit(X_train, y_train)
plt.scatter(X_train, y_train, color='blue')
plt.plot(X_train, regressor.predict(X_test), color='red')
The problem is when I am trying to plot it using matplotlib, since the as_frame=True returns (data, target) where the data is a DataFrame object and target as Series.
Traceback (most recent call last):
File "C:/Users/Kelvin/OneDrive/Documents/analytics/diabetes-sklearn/test.py", line 19, in <module>
plt.scatter(X_train, y_train, color='blue')
File "C:\Users\Kelvin\OneDrive\Desktop\analytics\lib\site-packages\matplotlib\pyplot.py", line 3037, in scatter
__ret = gca().scatter(
File "C:\Users\Kelvin\OneDrive\Desktop\analytics\lib\site-packages\matplotlib\__init__.py", line 1352, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\Kelvin\OneDrive\Desktop\analytics\lib\site-packages\matplotlib\axes\_axes.py", line 4478, in scatter
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
So, my question is if there are ways that I can change the whole data as DataFrame just like how we get the data using pd.read_csv()?

That is already a dataframe, you are getting error because you are plotting X_train with y_train and X_train has multiple columns.
but if you want your dataset in csv file you can use this code.
X.to_csv('train_data.csv')
this will save that dataset into a csv file in your working directory. Now you can use pd.read_csv on train_data.csv.

Related

Sklearn can't convert string to float

I'm using Sklearn as a machine learning tool, but every time I run my code, it gives this error:
Traceback (most recent call last):
File "C:\Users\FakeUserMadeUp\Desktop\Python\Machine Learning\MachineLearning.py", line 12, in <module>
model.fit(X_train, Y_train)
File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 942, in fit
X_idx_sorted=X_idx_sorted,
File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\tree\_classes.py", line 166, in fit
X, y, validate_separately=(check_X_params, check_y_params)
File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\base.py", line 578, in _validate_data
X = check_array(X, **check_X_params)
File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\validation.py", line 746, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\FakeUserMadeUp\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 1993, in __ array __
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'Paris'
Here is the code, and down below there's my dataset:
(I've tried multiple different datasets, also, this dataset is a txt because I made it myself and am to dumb to convert it to csv.)
import pandas as pd
from sklearn.tree import DecisionTreeClassifier as dtc
from sklearn.model_selection import train_test_split as tts
city_data = pd.read_csv('TimeZoneTable.txt')
X = city_data.drop(columns=['Country'])
Y = city_data['Country']
X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2)
model = dtc()
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
print(Y_test)
print(predictions)
Dataset:
CityName,Country,Latitude,Longitude,TimeZone
Moscow,Russia,55.45'N,37.37'E,3
Vienna,Austria,48.13'N,16.22'E,2
Barcelona,Spain,41.23'N,2.11'E,2
Madrid,Spain,40.25'N,3.42'W,2
Lisbon,Portugal,38.44'N,9.09'W,1
London,UK,51.30'N,0.08'W,1
Cardiff,UK,51.29'N,3.11'W,1
Edinburgh,UK,55.57'N,3.11'W,1
Dublin,Ireland,53.21'N,6.16'W,1
Paris,France,48.51'N,2.21'E,2
Machine learning algorithms and in particular the random forest work exclusively with input numbers. If you want to improve your model it is even recommended to normalize your model between -1;1 in general and therefore to use decimal numbers, hence the expectation of a float.
In your case, your dataframe seems to contain exclusively string entries. As Dilara Gokay said, you first need to transform your strings into floats and to do so, use what is called an onehotencoder. I let you follow this tutorial if you don't know how to do it.

Error `` `MultiLabelBinarizer``` when importing strings from a csv to a fit () function to train a model with scikit-learn

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('coords.csv',sep=';') #Cargo el archivo csv
x = df.iloc[1:,1:] #features values
y = df.iloc[1:,0] #target value
y = y.apply(lambda y: y.encode())
print(x)
print(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1234)
print(x_train)
print(y_train)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
pipelines = {
'lr':make_pipeline(StandardScaler(), LogisticRegression()),
'rc':make_pipeline(StandardScaler(), RidgeClassifier()),
'rf':make_pipeline(StandardScaler(), RandomForestClassifier()),
'gb':make_pipeline(StandardScaler(), GradientBoostingClassifier()),
}
fit_models = {}
for algo, pipeline in pipelines.items():
model = pipeline.fit(x_train, y_train)
fit_models[algo] = model
print(fit_models)
print(fit_models['lr'].predict(x_test))
print(fit_models['rc'].predict(x_test))
print(fit_models['rf'].predict(x_test))
print(fit_models['gb'].predict(x_test))
I was having a problem when trying to load strings from a csv file, because it tells me:
Traceback (most recent call last):
File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module>
model = pipeline.fit(x_train, y_train)
File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\optimize.py", line 243, in _check_optimize_result
).format(solver, result.status, result.message.decode("latin1"))
AttributeError: 'str' object has no attribute 'decode'
And when I add y = y.apply (lambda y: y.encode ()) because I thought I needed to transform strings to bytes, I get this:
Traceback (most recent call last):
File "3_Train_Custom_Model_Using_Scikit_Learn.py", line 99, in <module>
model = pipeline.fit(x_train, y_train)
File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1345, in fit
check_classification_targets(y)
File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 169, in check_classification_targets
y_type = type_of_target(y)
File "C:\Users\PC0\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 263, in type_of_target
raise ValueError('You appear to be using a legacy multi-label data'
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
How do I so that the data framed in red from the csv that you see in the following Excel screenshot, which would be the targets, are saved in the variable y, and those that are framed in blue that It would be the features (x1, y1, z1, v1, x2, y2, z2, v2, ..., x501, y501, z501, v501) that must be saved in the variable x.
Try this:
df = pd.read_csv('testing.csv',sep=';',header=1)
x = df.iloc[:,1:] #features values
y = df.iloc[:,0] #target value
#y = y.apply(lambda y: y.encode())
print(x)
print(y)
...

ValueError: Number of labels=25 does not match number of samples=56

I am getting this in
C:/Users/HP/.PyCharmCE2019.1/config/scratches/scratch.py Traceback
(most recent call last):
File "C:/Users/HP/.PyCharmCE2019.1/config/scratches/scratch.py", line 25,
in dtree.fit(x_train,y_train)
File "C:\Users\HP\PycharmProjects\untitled\venv\lib\site-packages\sklearn\tree\tree.py",
line 801, in fit X_idx_sorted=X_idx_sorted)
File "C:\Users\HP\PycharmProjects\untitled\venv\lib\site-packages\sklearn\tree\tree.py",
line 236, in fit "number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=45 does not match number of samples=36
I am using DecisionTree model but I am getting error. Help will be appreciated.
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#reading the dataset
df=pd.read_csv(r'C:\csv\kyphosis.csv')
print(df)
print(df.head())
#visualising the dataset
print(sns.pairplot(df,hue='Kyphosis',palette='Set1'))
plt.show()
#training and testing
from sklearn.modelselection import traintestsplit
c=df.drop('Kyphosis',axis=1) d=df['Kyphosis']
xtrain,ytrain,xtest,ytest=traintestsplit(c,d,testsize=0.30)
#Decision_Tree
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(xtrain,ytrain)
#Predictions
predictions=dtree.predict(xtest) from sklearn.metrics import
classificationreport,confusionmatrix
print(classificationreport(ytest,predictions))
print(confusionmatrix(y_test,predictions))
Expected result should be my classification_report and confusion_matrix
So, the error is thrown by the function dtree.fit(xtrain, ytrain), because xtrain and ytrain have unequal length.
Checking the part of code that generates it:
xtrain,ytrain,xtest,ytest=traintestsplit(c,d,testsize=0.30)
and comparing to the example in the documentation
import numpy as np
from sklearn.model_selection import train_test_split
[...]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
you can see two things:
1 traintestsplit should be train_test_split
2 by changing the order of variables left of the =, you assign different data to these variables.
so, your code should be:
xtrain, xtest, ytrain, ytest = train_test_split(c,d,testsize=0.30)

matplotlib error: x and y must be the same size

How can I fix ‘ValueError: x and y must be the same size` error?
The idea of the code is that from different sensors of temperature and NO data applied the model of Multivariate Linear Regression. To train the model and see the results correlated among them, as well as the prediction as a whole.
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
# Name of de file
filename = 'NORM_AC_HAE.csv'
file = 'NORM_NABEL_HAE_lev1.csv'
# Read the data
data=pd.read_csv(filename)
data_other=pd.read_csv(file)
col = ['Aircube.009.0.no.we.aux.ch6', 'Aircube.009.0.sht.temperature.ch1']
X = data.loc[:, col]
Y = data_other.loc[:,'NO.ppb']
# Fitting the Liner Regression to training set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, train_size = 0.6, random_state = np.random.seed(0))
mlr = LinearRegression()
mlr.fit(X_train, y_train)
# Visualization of the test set results
plt.figure(2)
plt.scatter(y_test, X_test) #The VALUE ERROR appears here
The Error Code is:
Traceback (most recent call last):
File "C:\Users\andre\Desktop\UV\4o\TFG\EMPA\dataset_Mila\MLR_no_temp_hae_no.py", line 65, in <module>
plt.scatter(y_test, X_test)
File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\pyplot.py", line 2864, in scatter
is not None else {}), **kwargs)
File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\__init__.py", line 1810, in inner
return func(ax, *args, **kwargs)
File "C:\Users\andre\AppData\Local\Programs\Python\Python37-32\lib\site-packages\matplotlib\axes\_axes.py", line 4182, in scatter
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
[Finished in 6.9s]
X_test.shape = [36648 rows x 2 columns]
Both data arguments in plt.scatter (here y_test and X_test) must be 1-dimensional arrays; from the docs:
x, y : array_like, shape (n, )
while here you attempt to pass a 2-dimensional matrix for X_test, hence the error of different size.
You cannot get a scatter plot of a matrix with an array/vector; what you could do is produce two separate scatter plots, one for each column in your X_test:
plt.figure(2)
plt.scatter(y_test, X_test.iloc[:,0].values)
plt.figure(3)
plt.scatter(y_test, X_test.iloc[:,1].values)

ValueError: array length does not match index length

I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.
I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.
Here is one modified script through which I am trying to implement XGBoost:
import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb
df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")
df_train = df_train.replace(-999999,2)
id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
clf = xgb.XGBClassifier(objective='binary:logistic',
missing=9999999999,
max_depth = 7,
n_estimators=200,
learning_rate=0.1,
nthread=4,
subsample=1.0,
colsample_bytree=0.5,
min_child_weight = 3,
reg_alpha=0.01,
seed=7)
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)
print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train)
'''
test = []
result = []
for each in id_test:
test.append(each)
for each in y_pred[:,1]:
result.append(each)
print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)
Here is the stacktrace :
Traceback (most recent call last):
File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818
I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know
The problem is that you defining X_test twice as #maxymoo mentioned. First you defined it as
X_test = df_test.drop(['ID'], axis=1).values
And then you redefine that with:
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
Which means now X_test have size equal to 0.4*len(X_train). Then after:
y_pred = clf.predict_proba(X_test)
you've got predictions for that part of X_train and you trying to create dataframe with that and initial id_test which has length of the original X_test.
You could use X_fit and X_eval in train_test_split and not hide initial X_train and X_test because for your cross_validation you also has different X_train which means you'll not get right answer or you cv would be inaccurate with public/private score.

Categories

Resources