I am trying to impute missing values, not with zeros or means, but with ML predicted results. I'm testing my idea on the standard 'Titanic' dataset, which has around 80% of the age records filled in, but around 20% are missing. How can I fill in missing values with the predicted results from a simple linear regression model? Here is the code that I am testing.
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
Everything works up to this point, but when I try to 'fillna', I get errors.
data_with_null.age.fillna(age,inplace=True)
The line of code directly above shows this error:
TypeError: "value" parameter must be a scalar, dict or Series, but you passed a "DataFrame"
data_with_null['age'].fillna(data_with_null[data_with_null['age'].isnull()].apply(axis=1),inplace=True)
Similarly, the line of code above shows this error:
TypeError: apply() missing 1 required positional argument: 'func'
Related
I am trying to do a simple linear regression. I've made a data frame in pandas and then typed. This doesn't remove any rows so I assume there are no missing values.
df.dropna() # remove all rows with missing data
Then I ran the following code to create NumPy arrays for 2 columns that I wish to run my linear regression between.
`Amy_Dif_Num = df.iloc[:, 10].values.reshape(-1, 1)`
Amy_Dif_Num.shape
Pain_Int_Num = df.iloc[:, 3].values.reshape(-1, 1) # values converts it into a numpy array
Pain_Int_Num.shape
Then when I run my regression, it errors that I have a missing value. What am I missing here? Please help.
X = Pain_Int_Num
Y = Amy_Dif_Num
np.isnan(X).any()
linear_regressor = LinearRegression() # create object for the class
linear_regressor.fit(X, Y) # perform linear regression
Y_pred = linear_regressor.predict(X) # make predictions
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
When I run
np.isnan(X).any() #this comes out True
How do I remove this null value whilst also continuing to have the same number of columns in each NumPy array? Please help.
Also, why didn't my df.dropna() code work earlier?
I am doing a simple ARIMAX model (1,0,0) with one dependent variable y, one independent variable x with 49 observations as a time series.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.read_excel('/Users/gaetanlion/Google Drive/Python/Arima/df.xlsx', sheet_name = 'final')
from statsmodels.tsa.arima_model import ARIMA
endo = df['y']
exo = df['x']
''' Doing a ARIMA(1,0,0) '''
model = ARIMA(endo, exo, order = (1,0,0)).fit()
When I run this simple model, I get the mentioned error:
TypeError: __new__() got multiple values for argument 'order'
Ok, I was able to resolve this coding issue. But, I am not so sure it is the best way to resolve it.
model = sm.tsa.arima.ARIMA(endo, exo, order =(1,0,0)).fit() # This works
model = ARIMA(endo, exo, order = (1,0,0)).fit() # This does not work
I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date']).set_index('date').T.iloc[0]
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):
I've tried to use the imputer to replace all of the NaN portions of my database with the averages of its respectful column. For example, I wanted to fix a blank entry in my database under the salary column and I want that blank section to be filled with the average salary values under that column. I tried doing this by following along with a tutorial but I think the video was outdated, resulting in this error.
Code:
#Data Proccesing
#Importing the Libaries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#Taking care of Missig Data
from sklearn.preprocessing import Imputer
#The source of all the problems
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform
Initially, X looked like this when compiled prior to using Imputer:
However, Once I compiled lines 16-18, I got this error and I'm not sure what to do
The line
imputer.transform
Should be
imputer.transform()
...With parentheses to actually call the method rather than assign it's name to something.
Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
train=pd.read_csv('C:\Users\ABDILLAH\Desktop\datasets\Rails\RailsDataset.csv')
features_col=['Num_comments', 'Num_Commits','Changed_files']
X=train.loc[:,features_col]
y=train.classes
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X,y)`
So if you need a sample of my dataset to check what is realy happened please let me know
I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().