I have made the following gaussian prediction model in SK-learn:
chess_gnb = GaussianNB().fit(raw[['elo', 'opponent_rating', 'winner_loser_elo_diff']],raw['winner'])
I then made a test array and attempted to feed it into the model:
test1 = [['elo', 1000], ['opponent_rating', 800], ['winner_loser_elo_diff', 200]]
chess_gnb.predict(test1)
However, I'm getting this error:
ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
The 'winner' prediction should be a string that can have one of two values. Why am I getting the valueError if all of my inputs are integers?
You need to provide a dataframe, using an example:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
np.random.seed(123)
raw = pd.DataFrame(np.random.uniform(0,1000,(100,3)),
columns = ['elo','opponent_rating','winner_loser_elo_diff'])
raw['winner'] = np.random.binomial(1,0.5,100)
chess_gnb = GaussianNB().fit(raw[['elo', 'opponent_rating', 'winner_loser_elo_diff']],raw['winner'])
This works:
test1 = pd.DataFrame({'elo': [1000],'opponent_rating':[800],'winner_loser_elo_diff':[200]})
chess_gnb.predict(test1)
Related
I was writing a python program to predict the price of a house by given area:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error
import pandas as pd
df = pd.read_csv('traindata.csv')
plt.xlabel('area')
plt.ylabel('price')
plt.scatter(df.area,df.price)
reg = linear_model.LinearRegression()
reg.fit(df[['area']], df.price)
reg.predict(33000)
When was executing the program, it showed that
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=33000.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Then I changed the (33000) to ([[33000]]) and it showed
UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(.
Then I changed it to ([['33000']]) and still showed the same error.
You cannot use ([['33000']]) because you will be trying to predict with a string value, which doesn't work.
If you are worried about the warning, you can create a data frame on the fly, for example :
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame({'area':np.random.randint(10000,40000,100),
'price':np.random.uniform(1,100,100)})
reg = linear_model.LinearRegression()
reg.fit(df[['area']], df.price)
reg.predict(pd.DataFrame({'area':[33000]}))
array([53.70626723])
But you can see that it's the same as if you do :
reg.predict([[33000]])
/Users/gen/anaconda3/lib/python3.8/site-packages/sklearn/base.py:445: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
warnings.warn(
Out[15]: array([53.70626723])
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
I'm fairly new on python and ML. I have a simple table that contains a date column and a float value. I want to predict the future sales for a given period, let's say 2022-01, I managed to obtain a prediction based on my data but the number of prediction values is equal to the number of given trained values. Also, isn't the meanSquaredError value too high? So far, i got the following:
import pandas as pd
import numpy as np
import datetime
df=pd.read_csv(r"Sale.csv")
#Breaki date column into multiple columns
df["Data"]=pd.to_datetime(df["Data"])
df["Data"]=df["Data"].dt.strftime("%d.%m.%Y")
df["Year"]=pd.DatetimeIndex(df["Data"]).year
df["Month"]=pd.DatetimeIndex(df["Data"]).month
df["Day"]=pd.DatetimeIndex(df["Data"]).day
df["Weekday"]=pd.DatetimeIndex(df["Data"]).weekday
df["Dayofyear"]=pd.DatetimeIndex(df["Data"]).dayofyear
df=df.drop(["Data"],axis=1) #drop initial column
## Dummy Encoding
df = pd.get_dummies(df, columns=['Year'], drop_first=False, prefix='Year')
df = pd.get_dummies(df, columns=['Month'], drop_first=True, prefix='Month')
df = pd.get_dummies(df, columns=['Weekday'], drop_first=True, prefix='Weekday')
##split Train and test data
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
target_column_train=['Sales']
predictors_train= list(set(list(train.columns))-set(target_column_train))
X_train=train[predictors_train].values
y_train=train[target_column_train].values
##Loading ML model
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train.ravel()) #.ravel will convert the array shape to (n, )
pred_train_rf= model_rf.predict(X_train)
print("RMSE:")
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
# 7956042.545725489
print ("\n r2_score(Coefficient of determination:) is : ")
print(r2_score(y_train, pred_train_rf))
# 0.9284689685103222
Data
DataVisualisation
When you run model.predict you are running it on your x_train rather than your test - that's why your prediction values are equal to that number. You want to fit your model on your train data, and predict on your test data.
I'm tying to plot a confusion matrix of a Neural Network, I already constructed and saved the model. I have 11 labels in my dataset.
I using this code:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
rounded_labels = np.argmax(y_test, axis=-1) #y_test are the test label, I use np.argmax to find an integer
test_model = load_model('/model.h5')
y_pred = test_model.predict(X_test, steps=1, verbose=0)
rounded_y_pred = np.argmax(y_pred, axis=-1) #I use np.argmax to find an integer prediction
And when I print rounded_y_pred I find some integer number from 0 to 10, it seems good because I have eleven labels but when I try to print the confusion matrix:
cm = confusion_matrix(y_true=rounded_labels, y_pred=rounded_y_pred)
I find this error: TypeError: Singleton array 23 cannot be considered a valid collection.
I really don't know how to fix it. Could someone help me? Thank you so much
I am attempting to use MultinomialNB from sklearn to classify some data. I have made a sample csv with some labelled training data, which I want to use to train the model but I receive the following error message:
ValueError: Expected 2D array, got 1D array instead: array=[0 1 2 2].
I know it is a very small data set but I will eventually add more data once the code is working.
Here is my data:
Here is my code:
import numpy as np
import pandas as pd
import array as array
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
data_file = pd.read_csv("CSV_Labels.csv", engine='python')
data_file.tail()
vectorizer = CountVectorizer(stop_words='english')
all_features = vectorizer.fit_transform(data_file.Word)
all_features.shape
x_train = data_file.label
y_train = data_file.Word
x_train.values.reshape(1, -1)
y_train.values.reshape(1, -1)
classifer = MultinomialNB()
classifer.fit(x_train, y_train)
Try this:
x_train = x_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
numpy reshape operations are not inplace. So the array's you're passing to the classifier have actual the old shapes.