I have some code that help me to predic tsome missing values.This is the code
from datawig import SimpleImputer
from datawig.utils import random_split
from sklearn.metrics import f1_score, classification_report
df_train, df_test = random_split(df, split_ratios=[0.8, 0.2])
# Initialize a SimpleImputer model
imputer = SimpleImputer(
input_columns=['SITUACION_DNI_A'], # columns containing information about
the column we want to impute
output_column='EXTRANJERO_A', # the column we'd like to impute values for
output_path='imputer_model' # stores model data and metrics
)
# Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=10)
# Impute missing values and return original dataframe with predictions
predictions = imputer.predict(df_test)
After that i get a new dataframe with less rows than the original, how can i insert the values that i get in the prediction into my original dataframe, or there's is a way to run the code with all my dataframe and not the test
If both the dataframe have a unique column or something that can act like an ID, then this method will work
df_test = df_test.set_index('unique_col')
df_test.fillna(predictions.set_index('unique_col'))
If the above method does not work, then drop the rows with that missing values and append the imputer predictions to the dataframe. look the following links for help
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
Delete rows if there are null values in a specific column in Pandas dataframe
Related
I'm fairly new on python and ML. I have a simple table that contains a date column and a float value. I want to predict the future sales for a given period, let's say 2022-01, I managed to obtain a prediction based on my data but the number of prediction values is equal to the number of given trained values. Also, isn't the meanSquaredError value too high? So far, i got the following:
import pandas as pd
import numpy as np
import datetime
df=pd.read_csv(r"Sale.csv")
#Breaki date column into multiple columns
df["Data"]=pd.to_datetime(df["Data"])
df["Data"]=df["Data"].dt.strftime("%d.%m.%Y")
df["Year"]=pd.DatetimeIndex(df["Data"]).year
df["Month"]=pd.DatetimeIndex(df["Data"]).month
df["Day"]=pd.DatetimeIndex(df["Data"]).day
df["Weekday"]=pd.DatetimeIndex(df["Data"]).weekday
df["Dayofyear"]=pd.DatetimeIndex(df["Data"]).dayofyear
df=df.drop(["Data"],axis=1) #drop initial column
## Dummy Encoding
df = pd.get_dummies(df, columns=['Year'], drop_first=False, prefix='Year')
df = pd.get_dummies(df, columns=['Month'], drop_first=True, prefix='Month')
df = pd.get_dummies(df, columns=['Weekday'], drop_first=True, prefix='Weekday')
##split Train and test data
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
target_column_train=['Sales']
predictors_train= list(set(list(train.columns))-set(target_column_train))
X_train=train[predictors_train].values
y_train=train[target_column_train].values
##Loading ML model
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train.ravel()) #.ravel will convert the array shape to (n, )
pred_train_rf= model_rf.predict(X_train)
print("RMSE:")
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
# 7956042.545725489
print ("\n r2_score(Coefficient of determination:) is : ")
print(r2_score(y_train, pred_train_rf))
# 0.9284689685103222
Data
DataVisualisation
When you run model.predict you are running it on your x_train rather than your test - that's why your prediction values are equal to that number. You want to fit your model on your train data, and predict on your test data.
I have a problem while reading the columns of my .csv file. I have this code:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Importing the dataset
dataset = pd.read_csv('D:/CTU/ateroskleroza/development/results_output6.csv')
print(dataset.head())
X = dataset.iloc[:, 2:16].values
y = dataset.iloc[:, 0].values
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
classifier = make_pipeline(StandardScaler(), SVC(gamma='auto'))
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Generating accuracy, precision, recall and f1-score
target_names = ['Progressive','Stable']
print(classification_report(y_test, y_pred, target_names=target_names))
And the .csv looks like this:
Depending of the name of the pictures they have some columns, some other are with Nan. The problem is that when I try to execute this code I have this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
So how can I ignore the Nan and only use the numbers? (I don't want to remove the empty columns, just ignore the Nan while executing).
I am writing this answer based on personal experience. If you want a more detailed answer, consider updating your post with a dataset we can use, state what the model is supposed to predict and describe the features.
#simpleApp suggested replacing null values with zero prior to scaling the data and fitting the model. In the comments, you seem concerned about the effects of imputing null values on the final model.
When dealing with missing data, you have to weigh the pros and cons of imputing values. If you decide to ignore the observations with null values (either by dropping columns or whole observations), you could be missing out on some really important information and you won't be able to make predictions on new observations unless their data is completely full. Likewise, if you carelessly impute null values with some random value, you could introduce a bias to the model.
If you impute values correctly, your model will be able to handle missing data without compromising much of its accuracy. Sadly though, imputing values is more of an art than a hard science.
I have no idea what your data means, but think of age as an independent variable to predict risk of heart disease. Ask yourself: if a value is missing, am I better off ignoring the observation, or can I fill the void with a value that, on average, should not be too far away from the real unobserved age of the patient?
If you decide you fill the missing information with some value, I would suggest four really simple methods:
# Fill with minimum value
df = df.fillna(df.mean(), axis=1)
# Fill with median value
df = df.fillna(df.median(), axis=0)
# Fill with mean value
df = df.fillna(df.mean(), axis=0)
# Fill with maximum value
df = df.fillna(df.max(), axis=0)
Your next step should be to score the resulting models and choose the one that generalizes best on unseen data.
Among other common imputation techniques, you can fill null values with zero (df.fillna(0)), with the most frequent value (check SimpleImputer) or with more complex imputing techniques, such as nearest neighbors.
In the end, you will find out if imputing nulls was the right thing to do when you test your model's performance on unseen data.
As a general rule of thumb, you should consider dropping all columns that have more than 20% of their values missing.
I am doing modelling lets say logistic regression and need to save the results in a dataframe (prediction results and a unique ID).
Code for predictions
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
I want that along with predictions, I should also have in a column a unique identifier from X_train in the predictions dataframe (right now predictions is a numpy array). Lets say the unique ID is ID column in X_train.
Expected output
predictions ID
11 1000
123 1001
and so on
You can include the unique ID from X_train along with the predictions as below.
#Modelling
from sklearn.linear_model import LogisticRegression
lr_clf.fit(X_train, y_train)
predictions=lr_clf.predict(test_data)
#Add ID along with prediction and save the pandas dataframe
predictions_df=pd.DataFrame(data={"ID":X_train["ID"],"Predictions":predictions})
predictions_df.to_csv(path="predictions_df.csv",index=False,quoting=3,sep=';')
In the train.csv data in Titanic Machine Learning project, some passengers have their age data missing so the pandas module fills it in as 'NaN' and when feeding it into a sklearn algorithm it does not accept it. I tried dataset.fillna('') but now it turns into a empty string and not a float. Please send help.
https://www.kaggle.com/c/titanic/data
import pandas as pd
from sklearn.cross_validation import train_test_split
dataset = pd.read_csv('train.csv')
#dataset = dataset.fillna()
def preprocess(df):
from sklearn.preprocessing import LabelEncoder
processed_df = df.copy()
le = LabelEncoder()
done = le.fit_transform(processed_df)
return done
survival = preprocess(dataset.Survived)
data = dataset.drop('Survived',axis= 1)
data = data.drop('PassengerId',axis=1)
data = data.drop('Embarked',axis = 1)
data = data.drop('Cabin',axis = 1)
data = data.drop('Fare',axis = 1)
data = data.drop('Ticket',axis = 1)
data = data.drop('Name',axis=1)
x_train,x_test,y_train,y_test=
train_test_split(data,survival,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import svm
from sklearn.metrics import accuracy_score
pipeline = make_pipeline(StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
pipeline.fit(x_train,y_train)
print(accuracy_score(pipeline.predict(x_test),y_test))
fillna replaces the Nan values with what you write so if you write '', it will be an empty string. just write:
dataset.fillna(0)
if you need to distinguish between 0 and Nan, you can try replace it with -1, that's what we do.
there are many methods you can use to deal with the missing values in a machine learning project :
drop all the column with missing values
drop row containing missing values
Set the values to some value (zero, the mean, the median, etc.).
For the third option :
Scikit-Learn provides a handy class to take care of missing values:
Imputer. Here is how to use it. First, you need to create an Imputer
instance, specifying that you want to replace each attribute’s missing
values with the median of that attribute:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") #or mean as you want
x_train = imputer.fit_transform(x_train)
x_test = imputer.fit_transform(x_test)
The result is a plain Numpy array containing the transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple.
NB : You could also add the imputer in the pipeline just before the scaler .
pipeline = make_pipeline(Imputer(strategy="median"),
StandardScaler(),
svm.SVC(kernel='rbf',C=0.1))
I tried this but couldn't get it to work for my data:
Use Scikit Learn to do linear regression on a time series pandas data frame
My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000) and DataFrame_2.shape = (40,74). I'm trying to do some type of linear regression, but DataFrame_2 contains NaN missing data values. When I DataFrame_2.dropna(how="any") the shape drops to (2,74).
Is there any linear regression algorithm in sklearn that can handle NaN values?
I'm modeling it after the load_boston from sklearn.datasets where X,y = boston.data, boston.target = (506,13),(506,)
Here's my simplified code:
X = DataFrame_1
for col in DataFrame_2.columns:
y = DataFrame_2[col]
model = LinearRegression()
model.fit(X,y)
#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I did the above format to get the shapes to match up of the matrices
If posting the DataFrame_2 would help, please comment below and I'll add it.
You can fill in the null values in y with imputation. In scikit-learn this is done with the following code snippet:
from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)
Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?
If your variable is a DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.
df.fillna(df.mean(), inplace=True)