I am doing modelling lets say logistic regression and need to save the results in a dataframe (prediction results and a unique ID).
Code for predictions
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
I want that along with predictions, I should also have in a column a unique identifier from X_train in the predictions dataframe (right now predictions is a numpy array). Lets say the unique ID is ID column in X_train.
Expected output
predictions ID
11 1000
123 1001
and so on
You can include the unique ID from X_train along with the predictions as below.
#Modelling
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
#Add ID along with prediction and save the pandas dataframe
predictions_df=pd.DataFrame(data={"ID":X_train["ID"],"Predictions":predictions})
predictions_df.to_csv(path="predictions_df.csv",index=False,quoting=3,sep=';')
Related
I'm fairly new on python and ML. I have a simple table that contains a date column and a float value. I want to predict the future sales for a given period, let's say 2022-01, I managed to obtain a prediction based on my data but the number of prediction values is equal to the number of given trained values. Also, isn't the meanSquaredError value too high? So far, i got the following:
import pandas as pd
import numpy as np
import datetime
df=pd.read_csv(r"Sale.csv")
#Breaki date column into multiple columns
df["Data"]=pd.to_datetime(df["Data"])
df["Data"]=df["Data"].dt.strftime("%d.%m.%Y")
df["Year"]=pd.DatetimeIndex(df["Data"]).year
df["Month"]=pd.DatetimeIndex(df["Data"]).month
df["Day"]=pd.DatetimeIndex(df["Data"]).day
df["Weekday"]=pd.DatetimeIndex(df["Data"]).weekday
df["Dayofyear"]=pd.DatetimeIndex(df["Data"]).dayofyear
df=df.drop(["Data"],axis=1) #drop initial column
## Dummy Encoding
df = pd.get_dummies(df, columns=['Year'], drop_first=False, prefix='Year')
df = pd.get_dummies(df, columns=['Month'], drop_first=True, prefix='Month')
df = pd.get_dummies(df, columns=['Weekday'], drop_first=True, prefix='Weekday')
##split Train and test data
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
target_column_train=['Sales']
predictors_train= list(set(list(train.columns))-set(target_column_train))
X_train=train[predictors_train].values
y_train=train[target_column_train].values
##Loading ML model
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train.ravel()) #.ravel will convert the array shape to (n, )
pred_train_rf= model_rf.predict(X_train)
print("RMSE:")
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
# 7956042.545725489
print ("\n r2_score(Coefficient of determination:) is : ")
print(r2_score(y_train, pred_train_rf))
# 0.9284689685103222
Data
DataVisualisation
When you run model.predict you are running it on your x_train rather than your test - that's why your prediction values are equal to that number. You want to fit your model on your train data, and predict on your test data.
I'm having a problem with sklearn.
When I train it with ".fit()" it shows me the ValueError "ValueError: could not convert string to float: 'Casado'"
This is my code:
"""
from sklearn.naive_bayes import GaussianNB
import pandas as pd
# 1. Create Naive Bayes classifier:
gaunb = GaussianNB()
# 2. Create dataset:
dataset = pd.read_csv("archivos_de_datos/Datos_Historicos_Clientes.csv")
X_train = dataset.drop(["Compra"], axis=1) #Here I removed the last column "Compra"
Y_train = dataset["Compra"] #This one only consists of that column "Compra"
print("X_train: ","\n", X_train)
print("Y_train: ","\n", Y_train)
dataset2 = pd.read_csv("archivos_de_datos/Nuevos_Clientes.csv")
X_test = dataset2.drop("Compra", axis=1)
print("X_test: ","\n", X_test)
# 3. Train classifier with dataset:
gaunb = gaunb.fit(X_train, Y_train) #Here shows "ValueError: could not convert string to float: 'Casado'"
# 4. Predict using classifier:
prediction = gaunb.predict(X_test)
print("PREDICTION: ",prediction)
"""
And the dataset I'm using is an .csv file that looks like this (but with more rows):
IdCliente,EstadoCivil,Profesion,Universitario,TieneVehiculo,Compra
1,Casado,Empresario,Si,No,No
2,Casado,Empresario,Si,Si,No
3,Soltero,Empresario,Si,No,Si
I'm trying to train it to determine (with a test dataset) whether the last column would be a Yes or No (Si or No)
I appreciate your help, I'm obviously new at this and I don't understand what am I doing wrong here
I would use onehotencoder to, like Lavin mentioned, make the yes or no a numerical value. A model such as this can't process categorical data.
Onehotencoder is used to handle binary data such as yes/no, male/female, while label encoder is used for categorical data with more than 2 values, ei, country names.
It will look something like this, however, you'll have to do this with all categorical data, not just your y column, and use label encoder for columns that are not binary ( more than 2 variables - for example, perhaps Estadio Civil)
Also I would suggest removing any dependent variables that don't contribute to your model, for instant client ID sounds like it may not add any value in determining your dependent variable. This is context specific, but something to keep in mind.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [Insert column number for your df])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
For the docs:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
More info:
https://contactsunny.medium.com/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621#:~:text=What%20one%20hot%20encoding%20does,which%20column%20has%20what%20value.&text=So%2C%20that's%20the%20difference%20between%20Label%20Encoding%20and%20One%20Hot%20Encoding.
I am trying to train a classifier to take in a news headline as input, and output tags that fit the following headline. My data contains a bunch of news headlines as the input variables and meta-tags for those headlines as the output variables.
I One-Hot_Encoded both the headlines and their corresponding meta-tags into two separate CSV's. I then combined them into one large data frame with the X_train values being a 5573x958 numpy array for the headline words, and the y_train values being a 5573x843 numpy array.
Here is the following image of a pandas data-frame containing my data in One-Hot-Encoded form.
The goal of my classifier is for me to feed in a headline and have the most related tags to that headline as the output. The problem I have is the following.
X_train = train_set.iloc[:, :958].values
X_train.shape
(out) (5573, 958)
y_train = train_set.iloc[:, 958:].values
y_train.shape
(out) (5573, 843)
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train, y_train)
When I train it using a naive-bayes classifier, I get the following error message:
bad input shape (5573, 843)
From what I researched, the only way I can have a multi-label target values is by One-Hot-Encoding them as when I tried LabelEncoder() or MultiLabelBinarizer() I had to specify the name of each column to be binarized and when I have over 800 columns (words) to specify, I could not figure out how do it. So I just One-Hot-Encoded them which I believe gives the same result, just the classifier doesn't like it as input. Any suggestions on how I can fix this?
You can use the Multi target classification of Sklearn. Here is an example :
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultiOutputClassifier(MultinomialNB()).fit(X_train, y_train)
You can see the documentation from this link sklearn.multioutput.MultiOutputClassifier
I have three type of classes (stetosa, versicolor, virginica) and also 4 other columns as sepal_length, sepal_width, petal_length, petal_width with around 150 rows and each it's filled with it's own information (so nothing is empty there). I need to predict the type of the class based on other columns.
This is what I have tried:
import numpy as np
import pandas as pd
df = pd.read_csv("data.csv")
X=df[["sepal_length","sepal_width","petal_length","petal_width"]]
y=df["class"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(y_train, X_train)
clf.predict(y_test)
The text marked reponse with this problem:
ValueError: could not convert string to float: 'virginica'
I need to do this with train and test.
You need to encode your data. in other words, transform each category in a number (int or float).
Map the following categories like this:
mapping={'setosa':0,'versicolor':1,'virginica':2}
y.map(mapping)
After you train your model, you will get 0,1 or 2 as a result. Convert it back and you'll have your predictions.
And by the way, if you are predicting a class, you must change your model. LinearRegression() is a numerical predictor it can only predict numerical values.
Try to use SVC, LogisticRegression or any other classification model instead.
I have a dataset that has a unique identifier and other features. It looks like this
ID LenA TypeA LenB TypeB Diff Score Response
123-456 51 M 101 L 50 0.2 0
234-567 46 S 49 S 3 0.9 1
345-678 87 M 70 M 17 0.7 0
I split it up into training and test data. I am trying to classify test data into two classes from a classifier trained on training data. I want the identifier in the training and testing dataset so I can map the predictions back to the IDs. Is there a way that I can assign the identifier column as a ID or non-predictor like we can do in Azure ML Studio or SAS?
I am using the DecisionTreeClassifier from Scikit-Learn. This is the code I have for the classifier.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(traindata, trainlabels)
If I just include the ID into the traindata, the code throws an error:
ValueError: invalid literal for float(): 123-456
Not knowing how you made your split I would suggest just making sure the ID column is not included in your training data. Something like this perhaps:
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['ID', 'Response'])].values, df.Response)
That will split only the values from the DataFrame not in ID or Response for the X values, and split Response for the y values.
But you will still not be able to use the DecisionTreeClassifier with this data as it contains strings. You will need to convert any column with categorical data, i.e. TypeA and TypeB to a numerical representation. The best way to do this in my opinion for sklearn is with the LabelEncoder. Using this will convert the categorical string labels ['M', 'S'] into [1, 2] which can be implemented with the DecisionTreeClassifier. If you need an example take a look at Passing categorical data to sklearn decision tree.
Update
Per your comment I now understand that you need to map back to the ID. In this case you can leverage pandas to your advantage. Set ID as the index of your data and then do the split, that way you will retain the ID value for all of your train and test data. Let's assume your data are already in a pandas dataframe.
df = df.set_index('ID')
X_train, X_test, y_train, y_test = test_train_split(df.ix[:, ~df.columns.isin(['Response'])], df.Response)
print(X_train)
LenA TypeA LenB TypeB Diff Score
ID
345-678 87 M 70 M 17 0.7
234-567 46 S 49 S 3 0.9
The pandas dataframe keep its order when you do transformation (except join/merge that create/drop row).
So, Here is step-by-step:
create df_test dataframe with 'id' column
create df_test2 that don't have 'id' column
df_test2 = df_test.drop(["id"], axis=1)
Input df_test2 into model for prediction pred = model.predict(df_test2)
create df_pred_final from 'id' column from df_test df_pred_final = df_test[["id"]]
add column 'target' into df_pred_final. The pair id-target should be map correctly df_pred_final["target"] = pred
Please take a look at my kaggle notebook. You might get the idea.
https://www.kaggle.com/tthien/20210412-complex-drop-c10-c2