I am trying to impute missing values, not with zeros or means, but with ML predicted results. I'm testing my idea on the standard 'Titanic' dataset, which has around 80% of the age records filled in, but around 20% are missing. How can I fill in missing values with the predicted results from a simple linear regression model? Here is the code that I am testing.
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
Everything works up to this point, but when I try to 'fillna', I get errors.
The line of code directly above shows this error:
TypeError: "value" parameter must be a scalar, dict or Series, but you passed a "DataFrame"
Similarly, the line of code above shows this error:
TypeError: apply() missing 1 required positional argument: 'func'
I am trying to do a simple linear regression. I've made a data frame in pandas and then typed. This doesn't remove any rows so I assume there are no missing values.
df.dropna() # remove all rows with missing data
Then I ran the following code to create NumPy arrays for 2 columns that I wish to run my linear regression between.
`Amy_Dif_Num = df.iloc[:, 10].values.reshape(-1, 1)`
Pain_Int_Num = df.iloc[:, 3].values.reshape(-1, 1) # values converts it into a numpy array
Then when I run my regression, it errors that I have a missing value. What am I missing here? Please help.
X = Pain_Int_Num
Y = Amy_Dif_Num
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
When I run
np.isnan(X).any() #this comes out True
How do I remove this null value whilst also continuing to have the same number of columns in each NumPy array? Please help.
Also, why didn't my df.dropna() code work earlier?
I am doing a simple ARIMAX model (1,0,0) with one dependent variable y, one independent variable x with 49 observations as a time series.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final')
from statsmodels.tsa.arima_model import ARIMA
endo = df['y']
exo = df['x']
''' Doing a ARIMA(1,0,0) '''
model = ARIMA(endo, exo, order = (1,0,0)).fit()
When I run this simple model, I get the mentioned error:
TypeError: __new__() got multiple values for argument 'order'
Ok, I was able to resolve this coding issue. But, I am not so sure it is the best way to resolve it.
model = sm.tsa.arima.ARIMA(endo, exo, order =(1,0,0)).fit() # This works
model = ARIMA(endo, exo, order = (1,0,0)).fit() # This does not work
I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):
I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
Should be
...With parentheses to actually call the method rather than assign it's name to something.
Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
features_col=['Num_comments', 'Num_Commits','Changed_files']
from sklearn.linear_model import LogisticRegression
So if you need a sample of my dataset to check what is realy happened please let me know
I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().