How to fill missing value using pre-trained model? - python

I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.
Time X Y Z Humidity
1/2/2017 13:00 31 22 21 48
1/2/2017 14:00 NaN 12 NaN NaN
1/2/2017 15:00 25 55 33 NaN
In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.
I have tried this so far:
with open('model_pickle','rb') as f:
mp = pickle.load(f)
for i, value in enumerate(df['Humidity'].values):
if np.isnan(value):
df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])
This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:
df = df.dropna()
dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18']
features = [ 'X', 'Y', 'Z']
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity
model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
pickle.dump(model,f)
I had no errors during training and saving the model.

For prediction, since you want to make sure you have all the X, Y, Z values, you can do,
df = df.dropna(subset = ["X", "Y", "Z"])
And now you can predict the values for the remaining valid examples as,
# where features = ["X", "Y", "Z"]
df['Humidity'] = mp.predict(df[features])
mp.predict will return prediction for all the rows, so there is no need to predict iteratively.
Edit:.
For inference, say you have a dataframe df, you can do,
# Get rows with missing Humidity where it can be predicted.
df_inference = df[df.Humidity.isnull()]
# remaining rows
df = df[df.Humidity.notnull()]
# This might still have rows with missing features.
# Since you cannot infer with missing features, Remove them too and add them to remaining rows
df = df.append(df_inference[df_inference[features].isnull().any(1)])
# and remove them from df_inference
df_inference = df_inference[~df_inference[features].isnull().any(1)]
#Now you can infer on these rows
df_inference['Humidity'] = mp.predict(df_inference[features])
# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
df = df.append(df_inference)
df.sort_index()

Can you report the error?
Anyway, if you have missing values you have different options to deal with them. You can either discard the datapoint entirely or try and infer the missing parts with a method of choice: mean, interpolation, etc.
Pandas documentation has a nice guide on how to deal with them:
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Try
df['Humidity'][i] = mp.predict(df[['X', 'Y', 'Z']][i])
This way you the data is passed as a single argument, as function expects. The way you wrote it, you split your data to 3 arguments.

Related

Using ordinary least squares regression with multiple predictor variables on pandas

I have the following code which is trying to predict a y variable, in this case 'distance', based on multiple predictor variables, which are stored in newdf[cols].
However, when I run the code, I get the outcome: 'Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)'.
Am I specifying the smf.ols() command in the wrong way?
I would be so grateful for a helping hand.
import statsmodels.api as sm
cols = newdf.drop(['distance', 'duration','short_id'],axis=1)
X = cols
Y = newdf['distance']
X = sm.add_constant(x)
resultmodel = sm.OLS(Y,X).fit()
print(resultmodel.summary())
The first 20 rows of X are:
The first 20 rows of Y are:
For the formula api you have to enter a formula as a string as the first argument. If you just want to enter X and y and use all columns of X, you can use the non-formula API. Basically just replace smf.ols with sm.OLS in your code.

Python prepare training data set with evenly distrubeted response variable

I am working on a small machine learning project.
The dataset which i use has 56 input parameters and one categorical response variable (0/1). My problem is that the response variables are not evenly distributed. Now my question I want to prepare the training data set, that the responses are evenly distributed. How can this be done?
That's how the data looks like
-> the training dataset should have the same amount of 1 and 0 from the response.
Thanks for your help, as you can imagine i am really a beginner...
i am the same person like the one who asked the question. sorry for that.
first i load the data from a csv file.(not in the code shown here) this is stored as data, next, i create a new column named " response_class" based on the value in the column "response" if it is below .045, response_class =1, other 0. second, i randomly sample 10000 rows from the data. (due to computation limits), and here i want to make sure that i get the same amount of 1 and 0 from the response_class. at the end i split the data to make it ready for a correlation matrix and test and train data
Here is my code:
data = data[data.response != 0]
pd.DataFrame(data)
data['response_class'] = np.where(data['response'] <= 0.045, 1, 0)
#1=below .045 / 0=above 0.045
#reduce amount of data by picking random samples
data= data.sample(n=10000)
#split data
data.drop(['response'], axis=1, inplace=True)
y = data['response_class']
X = data.drop('response_class', axis=1)
X_names = X.columns
data.head()
found a solution:
#seperate based on the response variable in response_class
df_zero = pd.DataFrame(data[data.response_class== 0])
df_one = pd.DataFrame(data[data.response_class == 1])
# upsampling minority class
df_zero_min = resample(df_zero,
replace = True,
n_samples = len(df_one),
random_state = 123)
df_upsampled = pd.concat([df_one,df_zero_min])
df_upsampled.response_class.value_counts()

Nearest Neighbor's kneighbors method return different output for different sample sizes

I've built a NearestNeighbor model with Scikit-learn. Clusters seem fine when get clusters with kneighbors method just after fitting model.
model = NearestNeighbors(n_jobs=-1, n_neighbors=5).fit(np.array(df))
distance, indices = model.kneighbors(np.array(df)) ## one of the distances is always 0, as expected. And clusters are acceptable.
But when I save model and then read for train data, outputs are not acceptable.
model = pickle.load(f)
distance, indices = model.kneighbors(np.array(df)) ## same dataset, average/bad results. None of distances are 0.
And, biggest problem, indices and distances change according to df size.
model = pickle.load(f)
df_1 = df[df["id"] == "1"] # Trying for just one user
distance, indices = model.kneighbors(np.array(df_1)) ## one row, same output for every user.
df_2 = df[df["id"] == "2"]
distance, indices = model.kneighbors(np.array(df_2)) ## same output
df = df[df["id"] == "2" | df["id"] == "1"]
distance, indices = model.kneighbors(np.array(df)) ## different output for both
Train/test dataset looks like this
feature1 | feature2 | feature3
0 1 1
1 1 1
0 0 0
Why we train and save model if it's not possible use after with different dataset? Is this expected behavior of model or am I missing something?
Well, it was a horrible mistake I made, and I want to share the problem and solution. Very simple but it may be hard to see.
I read docs thousand times, and then noticed they use np.array but not DataFrame. Well, I used Dataframe for prediction and columns randomized. So, it was not working correctly.
If you have problem like that, be careful about numpy indexes!

Linear Regression prediction by Date in python

I have converted date into numerical values but I am stuck on next step how to preparing data for prediction, How to use date for prediction in python code? How to count eventhappen attribute Please guide me and improve my code where it does not make any sense. Below is my code
#Here is Dataset
date Eventhappen
2016-01-14 A
2016-01-15 C
2016-01-16 B
2016-01-17 A
2016-01-18 C
2016-02-18 B
#Converting Date into Numerical Value
df['Dispatch_Date_Time'] = pd.to_datetime(df['Dispatch_Date_Time'])
df.set_index('Dispatch_Date_Time', inplace=True)
df.sort_index(inplace=True)
df['month'] = df.index.month
df['year'] = df.index.year
df['day'] = df.index.day
df['eventhappen'] = 1
#Preparing the data
X = df[['year']]
y = df['eventhappen']
#Trainng the Algorithm
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Making the Predictions
y_pred = regressor.predict(X_test)
#Plotting the Least Square Line
sns.pairplot(df, x_vars=['year'], y_vars='eventhappen', size=7, aspect=0.7, kind='reg')
There is a lot of confusion in your code, for me at least. The column names are not the same used in the processing.You have two scenarios to consider :
SN-A : If you want to predict which event is happening on some future date, the target column which is 'Eventhappen' will be categorical, you have a multi-classification task not a regression one, so you should encode your target column, then split your dataset using train/test split and finally implement a classifier to predict the event on some future date.
SN-B : If you want to predict the number of the events happening on some future date, then you are on the right way, you should have a numerical column to predict which is the count. It means that this line of code should not be a constant :
df['eventhappen'] = 1
When you have it, you should consider some timeseries techniques (power transformation, lags...), then split into train/test datasets and finally implement/evaluate your regressor model.
Use this function to extract the features needed from the date column and then use them directly in your machine learning model. You can also encode the cyclic features, that gives the model the ability to extract cyclic insights from the data.
def transform_col_date(data, date_col):
'''
data : Dataframe (Your dataset).
date_col : String (name of the date column)
'''
data_ = data.copy()
data_.reset_index(inplace=True)
data_[date_col] = pd.to_datetime(data_[date_col], infer_datetime_format = True)
data_['day'] = data_[date_col].dt.day
data_['month'] = data_[date_col].dt.month
data_['dayofweek'] = data_[date_col].dt.dayofweek
data_['dayofyear'] = data_[date_col].dt.dayofyear
data_['quarter'] = data_[date_col].dt.quarter
data_['weekofyear'] = data_[date_col].dt.weekofyear
data_['year'] = data_[date_col].dt.year
return data_
#in your case
data = transform_col_date(df, 'date')

Very Large Values Predicted for Linear Regression

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression(normalize = False)
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.
Also, if there is a better way to do this I'd be eager to hear about it.
You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.
If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)
If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:
df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1 C2 C3 y
0 1 0 0 12.4
1 1 0 0 11.9
2 0 1 0 8.3
3 0 1 0 3.1
4 0 0 1 5.4
5 0 0 1 6.2
In [73]:
y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:
reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:
_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:
reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:
reg1.coef_
Out[77]:
array([ 12.15, 5.7 , 5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False
A detailed explanation for a similar question at below.
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Categories

Resources