I'm fairly new on python and ML. I have a simple table that contains a date column and a float value. I want to predict the future sales for a given period, let's say 2022-01, I managed to obtain a prediction based on my data but the number of prediction values is equal to the number of given trained values. Also, isn't the meanSquaredError value too high? So far, i got the following:
import pandas as pd
import numpy as np
import datetime
df=pd.read_csv(r"Sale.csv")
#Breaki date column into multiple columns
df["Data"]=pd.to_datetime(df["Data"])
df["Data"]=df["Data"].dt.strftime("%d.%m.%Y")
df["Year"]=pd.DatetimeIndex(df["Data"]).year
df["Month"]=pd.DatetimeIndex(df["Data"]).month
df["Day"]=pd.DatetimeIndex(df["Data"]).day
df["Weekday"]=pd.DatetimeIndex(df["Data"]).weekday
df["Dayofyear"]=pd.DatetimeIndex(df["Data"]).dayofyear
df=df.drop(["Data"],axis=1) #drop initial column
## Dummy Encoding
df = pd.get_dummies(df, columns=['Year'], drop_first=False, prefix='Year')
df = pd.get_dummies(df, columns=['Month'], drop_first=True, prefix='Month')
df = pd.get_dummies(df, columns=['Weekday'], drop_first=True, prefix='Weekday')
##split Train and test data
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
target_column_train=['Sales']
predictors_train= list(set(list(train.columns))-set(target_column_train))
X_train=train[predictors_train].values
y_train=train[target_column_train].values
##Loading ML model
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train.ravel()) #.ravel will convert the array shape to (n, )
pred_train_rf= model_rf.predict(X_train)
print("RMSE:")
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
# 7956042.545725489
print ("\n r2_score(Coefficient of determination:) is : ")
print(r2_score(y_train, pred_train_rf))
# 0.9284689685103222
Data
DataVisualisation
When you run model.predict you are running it on your x_train rather than your test - that's why your prediction values are equal to that number. You want to fit your model on your train data, and predict on your test data.
Related
I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.
I have three type of classes (stetosa, versicolor, virginica) and also 4 other columns as sepal_length, sepal_width, petal_length, petal_width with around 150 rows and each it's filled with it's own information (so nothing is empty there). I need to predict the type of the class based on other columns.
This is what I have tried:
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
X=df[["sepal_length","sepal_width","petal_length","petal_width"]]
y=df["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(y_train, X_train)
clf.predict(y_test)
The text marked reponse with this problem:
ValueError: could not convert string to float: 'virginica'
I need to do this with train and test.
You need to encode your data. in other words, transform each category in a number (int or float).
Map the following categories like this:
mapping={'setosa':0,'versicolor':1,'virginica':2}
y.map(mapping)
After you train your model, you will get 0,1 or 2 as a result. Convert it back and you'll have your predictions.
And by the way, if you are predicting a class, you must change your model. LinearRegression() is a numerical predictor it can only predict numerical values.
Try to use SVC, LogisticRegression or any other classification model instead.
I am doing modelling lets say logistic regression and need to save the results in a dataframe (prediction results and a unique ID).
Code for predictions
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
I want that along with predictions, I should also have in a column a unique identifier from X_train in the predictions dataframe (right now predictions is a numpy array). Lets say the unique ID is ID column in X_train.
Expected output
predictions ID
11 1000
123 1001
and so on
You can include the unique ID from X_train along with the predictions as below.
#Modelling
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
#Add ID along with prediction and save the pandas dataframe
predictions_df=pd.DataFrame(data={"ID":X_train["ID"],"Predictions":predictions})
predictions_df.to_csv(path="predictions_df.csv",index=False,quoting=3,sep=';')
After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Sheet2
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
allvar_train
target train
The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest
Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)
In the train.csv data in Titanic Machine Learning project, some passengers have their age data missing so the pandas module fills it in as 'NaN' and when feeding it into a sklearn algorithm it does not accept it. I tried dataset.fillna('') but now it turns into a empty string and not a float. Please send help.
https://www.kaggle.com/c/titanic/data
import pandas as pd
from sklearn.cross_validation import train_test_split
dataset = pd.read_csv('train.csv')
#dataset = dataset.fillna()
def preprocess(df):
from sklearn.preprocessing import LabelEncoder
processed_df = df.copy()
le = LabelEncoder()
done = le.fit_transform(processed_df)
return done
survival = preprocess(dataset.Survived)
data = dataset.drop('Survived',axis= 1)
data = data.drop('PassengerId',axis=1)
data = data.drop('Embarked',axis = 1)
data = data.drop('Cabin',axis = 1)
data = data.drop('Fare',axis = 1)
data = data.drop('Ticket',axis = 1)
data = data.drop('Name',axis=1)
x_train,x_test,y_train,y_test=
train_test_split(data,survival,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
pipeline = make_pipeline(StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
pipeline.fit(x_train,y_train)
print(accuracy_score(pipeline.predict(x_test),y_test))
fillna replaces the Nan values with what you write so if you write '', it will be an empty string. just write:
dataset.fillna(0)
if you need to distinguish between 0 and Nan, you can try replace it with -1, that's what we do.
there are many methods you can use to deal with the missing values in a machine learning project :
drop all the column with missing values
drop row containing missing values
Set the values to some value (zero, the mean, the median, etc.).
For the third option :
Scikit-Learn provides a handy class to take care of missing values:
Imputer. Here is how to use it. First, you need to create an Imputer
instance, specifying that you want to replace each attribute’s missing
values with the median of that attribute:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") #or mean as you want
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)
The result is a plain Numpy array containing the transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple.
NB : You could also add the imputer in the pipeline just before the scaler .
pipeline = make_pipeline(Imputer(strategy="median"),
StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))